Nodes are crashing, out of memory #40

dismantl · 2013-10-01T22:15:34Z

Picostations flashed with DR2 seem to crash periodically due to out of memory issues (nodes have ~700K free memory right before crash, as reported by top, and no other conditions for crash have been found). I have not been able to reliably reproduce the crash conditions, however, and we do not currently have a method for accurately measure process memory usage. However, both Serval and Nodogsplash report high memory usage.

dismantl · 2013-10-01T22:17:54Z

@hawkinswnaf gave the good suggestion of using a serial console to monitor dmesg output to catch OOM messages right before crash.

critzo · 2013-10-04T07:59:07Z

Additional testing has confirmed this issue is likely due to high memory usage in the serval daemon.
To stabilize DR2 routers, stop and disable servald:

$ /etc/init.d/serval-dna stop
$ /etc/init.d/serval-dna disable
$ killall servald

westbywest · 2013-10-04T08:25:02Z

Likewise, leading from chatting briefly with Dan here, I suggested adding the "zram-swap" package presently in OpenWRT trunk to the commotion packages feed, and then enabling swap in kernelconfig. This would let you enable compressed swap memory on nodes, and ideally make the memory limit somewhat softer (i.e. help nodes avoid OOM errors and processes crashing).

So, specifically, do make kernel_menuconfig and make these selections:
General Setup -> Support for paging of anonymous memory (swap) *
Device Drivers -> Staging drivers -> Compressed RAM block device support *

This kernel config change can also be done via a patch, and such a patch is buried somewhere in openwrt-devel listserv archives (i.e. when the zram package was originally announced).

Then copy the zram_swap package from trunk into commotionfeed and enable it.

For my nodes with 32MB of RAM, I specify 6MB swap in /etc/config/system:

config system
    ...
    option 'zram_size_mb' '6'

You can periodically check swap usage to ensure nothing is using excessive RAM:

root@nsm5-b:~# free
             total         used         free       shared      buffers
Mem:         29184        24100         5084            0         2752
-/+ buffers:              21348         7836
Swap:         6140            0         6140

hawkinswnaf · 2013-10-04T15:59:43Z

Now THIS is cool. I can't wait to try it!

On 10/04/2013 04:25 AM, Ben West wrote:

Likewise, leading from chatting briefly with Dan here, I suggested
adding the "zram-swap" package presently in OpenWRT trunk to the
commotion packages feed, and then enabling swap in kernelconfig. This
would let you enable compressed swap memory on nodes, and ideally make
the memory limit somewhat softer (i.e. help nodes avoid OOM errors and
processes crashing).

So, specifically, do make kernel_menuconfig and make these selections:
General Setup -> Support for paging of anonymous memory (swap) *
Device Drivers -> Staging drivers -> Compressed RAM block device support *

This kernel config change can also be done via a patch, and such a patch
is buried somewhere in openwrt-devel listserv archives (i.e. when the
zram package was originally announced).

Then copy the zram_swap package from trunk into commotionfeed and enable it.

For my nodes with 32MB of RAM, I specify 6MB swap in /etc/config/system:

|config system
...
option 'zram_size_mb' '6'
|

You can periodically check swap usage to ensure nothing is using
excessive RAM:

|root@nsm5-b:~# free
total used free shared buffers
Mem: 29184 24100 5084 0 2752
-/+ buffers: 21348 7836
Swap: 6140 0 6140
|

—
Reply to this email directly or view it on GitHub
#40 (comment).

westbywest · 2013-10-21T16:54:50Z

I am running zram-swap as described above on WasabiNet nodes, both 5.8GHz mesh backhaul and 2.4GHz mesh APs, and I can confirm that certain processes do not like to be swapped out, freezing or behaving erratically as a result. hostapd, wpa_supplicant, olsrd, crond/busybox, and whatever your captive portal agent is (nodogsplash or coovachilli), all certainly shouldn't be swapped out. Possibly commotiond too, although I've not had opportunity to test that.

So, the zram_swap method described above lacks a bit in robustness. I think the mlock command can be used to prevent specific processes from being swapped out, although I'm uncertain of whether OpenWRT has this tool integrated.

areynold · 2014-03-06T18:20:54Z

@andygunn Can we get the Detroit folks to review this issue for 1.1 using @elationfoundation's crashlog scripts?

westbywest · 2014-03-06T18:37:14Z

To follow-up here, I've seen the hostapd and/or wpa_supplicant processes on my Wasabi Nanostation M2 nodes occasionally crash under heavy load, and it does look to be well coupled with memory exhaustion from some misbehaving process. This is happening even with 3MB of zram swap, and furthermore even with vm.swappiness=0 specified in /etc/sysctl.conf. Although, noticeably fewer crashes with swappiness turned all the way down.

When the hostapd or wpa_supplicant processes crash, you will see whichever wireless wireless VIF that process manages (e.g. the mesh adhoc VIF, the private AP) become unresponsive, even though the SSID remains visible. Note the process names "hostapd" and "wpa_supplicant" appear in the process table, regardless of if you're using the wpad or wpad-mini packages.

You can verify such crashes with the presence of files like these in /tmp:
/tmp/wpa_supplicant.1497.11.1393700006.core
/tmp/hostapd.1336.11.1384486446.core

andygunn · 2014-03-06T18:38:03Z

@areynold Sure - I will need to pass along instructions to folks in Detroit, is there a wiki page or existing set of documentation to test with. Where are the crashlog scripts? @elationfoundation can you send them to me?

seamustuohy · 2014-03-06T18:58:42Z

I have the notes for testing for restarting nodes in the "I think my node is restarting. How can I tell?" Section of https://wiki.commotionwireless.net/doku.php?id=development_resources:router:troubleshooting_routers This will create a file on the node that will log whenever it restarts.

jheretic · 2014-04-16T14:44:51Z

What we're seeing at this point is that there are lots of different memory conditions that can cause a node to crash. While there are still memory leaks we're dealing with, the original leak that this issue was pointing to is way out of date, so I'm closing this out in favor of other, more specific issues.

ghost assigned dismantl Oct 31, 2013

jheretic closed this as completed Apr 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes are crashing, out of memory #40

Nodes are crashing, out of memory #40

dismantl commented Oct 1, 2013

dismantl commented Oct 1, 2013

critzo commented Oct 4, 2013

westbywest commented Oct 4, 2013

hawkinswnaf commented Oct 4, 2013

westbywest commented Oct 21, 2013

areynold commented Mar 6, 2014

westbywest commented Mar 6, 2014

andygunn commented Mar 6, 2014

seamustuohy commented Mar 6, 2014

jheretic commented Apr 16, 2014

Nodes are crashing, out of memory #40

Nodes are crashing, out of memory #40

Comments

dismantl commented Oct 1, 2013

dismantl commented Oct 1, 2013

critzo commented Oct 4, 2013

westbywest commented Oct 4, 2013

hawkinswnaf commented Oct 4, 2013

westbywest commented Oct 21, 2013

areynold commented Mar 6, 2014

westbywest commented Mar 6, 2014

andygunn commented Mar 6, 2014

seamustuohy commented Mar 6, 2014

jheretic commented Apr 16, 2014