Node v.11.2 died after 3h #780

stefonarch · 2018-04-04T11:40:35Z

Description of bug:
Node just died without messages in the log after running for 3 hours. Compiled from git master; commit 84a6b51

Additional information you deem important (e.g. issue happens only occasionally):
System Load seems significant higher than with 11.1.1, sometimes 90%-180% in top, will attach graph when more useful in some hours.

Environment:
Arch Linux Server, 4gb ram, SSD
CPU(s): 2 Single core Intel Xeon E5-2680 v2s (-MT-SMP-) cache: 32768 KB
clock speeds: max: 2799 MHz 1: 2799 MHz 2: 2799 MHz

Raiblocks folder stored in ramdisk.
Node Monitor: https://nanode21.cloud/stats.php

logs
https://nanode21.cloud/11.2.log

The text was updated successfully, but these errors were encountered:

stefonarch · 2018-04-04T20:37:31Z

The graph, it has to be taken with a grain of salt as I did run some RPC scripts which lastet some minutes, but CPU usage is higher than before, traffic too. Update to v11.2 10:00 circa, crash 13:00

dbachm123 · 2018-04-06T13:33:16Z

@stefonarch Could you chime in on the discussion in #743? What's your bandwidth data like after updating to v11.1 and v11.2?

stefonarch · 2018-04-06T21:05:12Z

Updated another node to v11.2, did exact the same after some hour, no useful info in log. But CPU usage is higher, definitely.

dbachm123 · 2018-04-07T08:01:11Z

Weird. My node has been running smoothly after the 11.2 upgrade. No crashes or problems whatsoever.

dbachm123 · 2018-04-07T18:28:40Z

I guess I've jinxed it - my node (11.2) crashed earlier today as well. No particularly relevant info in the logs :(

clemahieu · 2018-04-07T18:36:22Z

Anything in dmesg logs?

dbachm123 · 2018-04-07T18:41:45Z

Maybe:

dmesg | grep rai

[286378.128376] rai_node invoked oom-killer: gfp_mask=0x2420848, order=0, oom_score_adj=0``
[286378.128382] rai_node cpuset=/ mems_allowed=0
[286378.128393] CPU: 0 PID: 15415 Comm: rai_node Not tainted 4.4.0-116-generic #140-Ubuntu
[286378.128919] [15405]  1001 15405 268778433   193574    1756       7        0             0 rai_node
[286378.128931] Out of memory: Kill process 15405 (rai_node) score 769 or sacrifice child
[286378.132356] Killed process 15405 (rai_node) total-vm:1075113732kB, anon-rss:774296kB, file-rss:0kB

This is on a 1GB RAM VPS that used to run all previous versions perfectly well. Memory usage doesn't look that much different with v11.2 compared to v11.1 and previous versions. Also, @stefonarch runs the node on machines with a bit more RAM.

cryptocode · 2018-04-07T18:48:46Z

@stefonarch did the OOM killer take down yours as well? 1GB is awfully tight, so is 4GB if running with a ramdisk. Does if fare better if not using tmpfs?

dbachm123 · 2018-04-07T19:10:04Z

I've just upgraded to a 2GB VPS - will continue to monitor --> http://138.197.179.164/

Here's a CPU / memory log. Node was running OK for several days, then CPU usage and memory started to spike which led to the OOM crash (~6pm).

dbachm123 · 2018-04-07T20:00:44Z

On http://wehavethetechnology.io there is the following comment: "Nodes seem to be getting behind/crashing when unchecked blocks are flushed." I have indeed flushed unchecked blocks yesterday via RPC. Maybe that gives a hint towards the problem.

oFLIPSTARo · 2018-04-07T20:12:11Z

@dbachm123 That one is my node, it's happened twice to me. Basically, I would see an "Unchecked blocks flushed." message but would not get the normal confirmation that it was completed. The node will continue to be able to connect with other reps but it does not check any blocks. The node will then shutdown after a broken pipe message not too long after.

I haven't had any problems since rebooting the system and restarting the node. I've seen one other person have the same log error. I think it was Prometheus on Discord. Everyone else's seems like their resources get used up then crashes. I have not seen their logs though, but It seems similar because the node will still be responsive to RPC commands but will not vote or check blocks.

dbachm123 · 2018-04-08T03:34:58Z

Also, the list of trusted reps at https://nanode21.cloud/representatives.php shows many offline nodes. And I haven’t seen any of those nodes offline before.

stefonarch · 2018-04-08T10:17:43Z

Ehm, there was a bug until 2 days, showing green dots always.... but yes, nodes need at least some crontab watching them.

Personally I didn't see no crash anymore on all 3 nodes.

EDIT: oops... 1gb 1core 11.2:

$ cat node.log 
Np RPC response, Node restartet at Fri  6 April  20:33
Np RPC response, Node restartet at Sat  7 April  02:00
Np RPC response, Node restartet at Sat  7 April  02:30
Np RPC response, Node restartet at Sat  7 April  21:30

stefonarch · 2018-04-08T13:09:34Z

@cryptocode OOM (out of memory?) killer? Never.
Actually my nodes run one with 1gb, one with 2gb and one with 4gb and ledger in ramdisk, but also the 1gb Ram usually runs very fine, but is not a representative. No docker all, just releases.

cryptocode · 2018-04-08T13:15:01Z

@stefonarch ok, dbachm123 had an oom-killer entry in his dmesg logs, figured I'd ask.

dbachm123 · 2018-04-08T16:56:14Z

Thanks @stefonarch
I have setup a watchdog that checks availability of RPC and restarts the node if RPC calls do not go through...

Joohansson · 2018-04-09T20:03:13Z

Problem started after I flushed/cleared 20000+ unchecked blocks I had since Jan. Could that be the reason?
Now I restart the docker node every hour to get a smooth behavior, or it will stop voting and eventually stop responding.

tmchow · 2018-04-10T01:01:37Z

I’m having same issue. Upgraded to 11.2 and periodically my node will crash and die. I issue a docker restart and everything is ok for an indeterminate amount of time.

I’m hosted on digital ocean VPS on Ubuntu with 1GB of RAM.

Releases prior to this ran fine including 11.0

BeeChains · 2018-04-10T01:15:04Z

I'm trying my best to figure out how to run a node. Where do I click lol code is completely new to me but I have a vision and drive to make, create, run a node, blockchain biz. Or help the best out there, please help me out, and I am doing everything on Google Pixel XL 2.

…

On Mon, Apr 9, 2018, 9:01 PM Trevin ***@***.***> wrote: I’m having same issue. Upgraded to 11.2 and periodically my node will crash and die. I issue a docker restart and everything is ok for an indeterminate amount of time. I’m hosted on digital ocean VPS on Ubuntu with 1GB of RAM. Releases prior to this ran fine including 11.0 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#780 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AkOkEsC5q8FAA6EMpajinUxPgCzNDZD1ks5tnAR4gaJpZM4TGpBa> .

tmchow · 2018-04-10T15:53:25Z

@BeeChains Wrong place to post that question. You cannot run a node on a mobile device. You need to host it on a server, like a Digital Ocean VPS: https://medium.com/@seanomlor/how-to-run-your-own-raiblocks-node-on-digitalocean-6a5a2492c29b

tmchow · 2018-04-10T15:55:56Z

I've seen several mentions of flushing unchecked blocks, but can't find out how to do this. I have 20k unchecked blocks for some reason that consistently stays at the value. Looking at other nodes, they have much less than this so I'm assuming something is wrong.

oFLIPSTARo · 2018-04-10T16:15:45Z

@tmchow you send an "unchecked_clear" RPC command. AFAIK that is not the same as when the node does it's own flushing of unchecked blocks. I would restart the node and see what if that fixes your issue.

tmchow · 2018-04-12T07:13:04Z

@oFLIPSTARo I hit this again.. looks like my node crashed this morning despite restarting the node. It was offline for about 8 hours. I just restarted the container now.

I don't know how to issue an RPC command to the container. Is there some simple steps to outline?

Joohansson · 2018-04-12T07:17:12Z

Yeah same problem for me. I have a cronjob restarting the node every 30min because every 60min was too long!

dbachm123 · 2018-04-13T09:12:26Z

Here's a very rough version of my watchdog scripts that run as cronjob. Will not work out of the box due to some hardcoded paths, but it might give you a starting point to better watch the node.

https://github.com/dbachm123/nanoNodeScripts

Joohansson · 2018-04-15T08:20:57Z

@dbachm123 Thanks, I'm running your code now. Works great!
Before running the script in crontab you also need to do a "chmod u+x nanoNodeWatchDog.py"
I'm running a docker node so my startscript just looks like this "docker restart nano"

meltingice · 2018-04-15T13:30:32Z

Just chiming in here with some strange activity. My node crashed and restarted last night, and when it did, it looks like it lost about 200k blocks from the reported block count. It's no longer catching up either.

dbachm123 · 2018-04-18T21:35:44Z

Commit f749697 is running very smoothly on my rep node. All previously observed issues are gone 👍

NiFNi · 2018-05-14T07:40:19Z

No issue for some time now. Close this?

stefonarch · 2018-05-14T08:54:51Z

Overcome by newer versions...

dbachm123 mentioned this issue Apr 14, 2018

Node stopped without a reason #799

Closed

stefonarch closed this as completed May 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node v.11.2 died after 3h #780

Node v.11.2 died after 3h #780

stefonarch commented Apr 4, 2018

stefonarch commented Apr 4, 2018

dbachm123 commented Apr 6, 2018

stefonarch commented Apr 6, 2018

dbachm123 commented Apr 7, 2018

dbachm123 commented Apr 7, 2018 •

edited

clemahieu commented Apr 7, 2018

dbachm123 commented Apr 7, 2018 •

edited

cryptocode commented Apr 7, 2018

dbachm123 commented Apr 7, 2018 •

edited

dbachm123 commented Apr 7, 2018

oFLIPSTARo commented Apr 7, 2018 •

edited

dbachm123 commented Apr 8, 2018

stefonarch commented Apr 8, 2018 •

edited

stefonarch commented Apr 8, 2018 •

edited

cryptocode commented Apr 8, 2018

dbachm123 commented Apr 8, 2018

Joohansson commented Apr 9, 2018 •

edited

tmchow commented Apr 10, 2018

BeeChains commented Apr 10, 2018 via email

tmchow commented Apr 10, 2018

tmchow commented Apr 10, 2018

oFLIPSTARo commented Apr 10, 2018

tmchow commented Apr 12, 2018

Joohansson commented Apr 12, 2018

dbachm123 commented Apr 13, 2018 •

edited

Joohansson commented Apr 15, 2018 •

edited

meltingice commented Apr 15, 2018

dbachm123 commented Apr 18, 2018

NiFNi commented May 14, 2018

stefonarch commented May 14, 2018

Node v.11.2 died after 3h #780

Node v.11.2 died after 3h #780

Comments

stefonarch commented Apr 4, 2018

stefonarch commented Apr 4, 2018

dbachm123 commented Apr 6, 2018

stefonarch commented Apr 6, 2018

dbachm123 commented Apr 7, 2018

dbachm123 commented Apr 7, 2018 • edited

clemahieu commented Apr 7, 2018

dbachm123 commented Apr 7, 2018 • edited

cryptocode commented Apr 7, 2018

dbachm123 commented Apr 7, 2018 • edited

dbachm123 commented Apr 7, 2018

oFLIPSTARo commented Apr 7, 2018 • edited

dbachm123 commented Apr 8, 2018

stefonarch commented Apr 8, 2018 • edited

stefonarch commented Apr 8, 2018 • edited

cryptocode commented Apr 8, 2018

dbachm123 commented Apr 8, 2018

Joohansson commented Apr 9, 2018 • edited

tmchow commented Apr 10, 2018

BeeChains commented Apr 10, 2018 via email

tmchow commented Apr 10, 2018

tmchow commented Apr 10, 2018

oFLIPSTARo commented Apr 10, 2018

tmchow commented Apr 12, 2018

Joohansson commented Apr 12, 2018

dbachm123 commented Apr 13, 2018 • edited

Joohansson commented Apr 15, 2018 • edited

meltingice commented Apr 15, 2018

dbachm123 commented Apr 18, 2018

NiFNi commented May 14, 2018

stefonarch commented May 14, 2018

dbachm123 commented Apr 7, 2018 •

edited

dbachm123 commented Apr 7, 2018 •

edited

dbachm123 commented Apr 7, 2018 •

edited

oFLIPSTARo commented Apr 7, 2018 •

edited

stefonarch commented Apr 8, 2018 •

edited

stefonarch commented Apr 8, 2018 •

edited

Joohansson commented Apr 9, 2018 •

edited

dbachm123 commented Apr 13, 2018 •

edited

Joohansson commented Apr 15, 2018 •

edited