Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes are crashing, out of memory #40

Closed
dismantl opened this issue Oct 1, 2013 · 10 comments
Closed

Nodes are crashing, out of memory #40

dismantl opened this issue Oct 1, 2013 · 10 comments

Comments

@dismantl
Copy link
Contributor

dismantl commented Oct 1, 2013

Picostations flashed with DR2 seem to crash periodically due to out of memory issues (nodes have ~700K free memory right before crash, as reported by top, and no other conditions for crash have been found). I have not been able to reliably reproduce the crash conditions, however, and we do not currently have a method for accurately measure process memory usage. However, both Serval and Nodogsplash report high memory usage.

@dismantl
Copy link
Contributor Author

dismantl commented Oct 1, 2013

@hawkinswnaf gave the good suggestion of using a serial console to monitor dmesg output to catch OOM messages right before crash.

@critzo
Copy link

critzo commented Oct 4, 2013

Additional testing has confirmed this issue is likely due to high memory usage in the serval daemon.
To stabilize DR2 routers, stop and disable servald:

$ /etc/init.d/serval-dna stop
$ /etc/init.d/serval-dna disable
$ killall servald

@westbywest
Copy link
Collaborator

Likewise, leading from chatting briefly with Dan here, I suggested adding the "zram-swap" package presently in OpenWRT trunk to the commotion packages feed, and then enabling swap in kernelconfig. This would let you enable compressed swap memory on nodes, and ideally make the memory limit somewhat softer (i.e. help nodes avoid OOM errors and processes crashing).

So, specifically, do make kernel_menuconfig and make these selections:
General Setup -> Support for paging of anonymous memory (swap) *
Device Drivers -> Staging drivers -> Compressed RAM block device support *

This kernel config change can also be done via a patch, and such a patch is buried somewhere in openwrt-devel listserv archives (i.e. when the zram package was originally announced).

Then copy the zram_swap package from trunk into commotionfeed and enable it.

For my nodes with 32MB of RAM, I specify 6MB swap in /etc/config/system:

config system
    ...
    option 'zram_size_mb' '6'

You can periodically check swap usage to ensure nothing is using excessive RAM:

root@nsm5-b:~# free
             total         used         free       shared      buffers
Mem:         29184        24100         5084            0         2752
-/+ buffers:              21348         7836
Swap:         6140            0         6140

@hawkinswnaf
Copy link
Collaborator

Now THIS is cool. I can't wait to try it!

On 10/04/2013 04:25 AM, Ben West wrote:

Likewise, leading from chatting briefly with Dan here, I suggested
adding the "zram-swap" package presently in OpenWRT trunk to the
commotion packages feed, and then enabling swap in kernelconfig. This
would let you enable compressed swap memory on nodes, and ideally make
the memory limit somewhat softer (i.e. help nodes avoid OOM errors and
processes crashing).

So, specifically, do make kernel_menuconfig and make these selections:
General Setup -> Support for paging of anonymous memory (swap) *
Device Drivers -> Staging drivers -> Compressed RAM block device support *

This kernel config change can also be done via a patch, and such a patch
is buried somewhere in openwrt-devel listserv archives (i.e. when the
zram package was originally announced).

Then copy the zram_swap package from trunk into commotionfeed and enable it.

For my nodes with 32MB of RAM, I specify 6MB swap in /etc/config/system:

|config system
...
option 'zram_size_mb' '6'
|

You can periodically check swap usage to ensure nothing is using
excessive RAM:

|root@nsm5-b:~# free
total used free shared buffers
Mem: 29184 24100 5084 0 2752
-/+ buffers: 21348 7836
Swap: 6140 0 6140
|


Reply to this email directly or view it on GitHub
#40 (comment).

@westbywest
Copy link
Collaborator

I am running zram-swap as described above on WasabiNet nodes, both 5.8GHz mesh backhaul and 2.4GHz mesh APs, and I can confirm that certain processes do not like to be swapped out, freezing or behaving erratically as a result. hostapd, wpa_supplicant, olsrd, crond/busybox, and whatever your captive portal agent is (nodogsplash or coovachilli), all certainly shouldn't be swapped out. Possibly commotiond too, although I've not had opportunity to test that.

So, the zram_swap method described above lacks a bit in robustness. I think the mlock command can be used to prevent specific processes from being swapped out, although I'm uncertain of whether OpenWRT has this tool integrated.

@ghost ghost assigned dismantl Oct 31, 2013
@areynold
Copy link
Collaborator

areynold commented Mar 6, 2014

@andygunn Can we get the Detroit folks to review this issue for 1.1 using @elationfoundation's crashlog scripts?

@westbywest
Copy link
Collaborator

To follow-up here, I've seen the hostapd and/or wpa_supplicant processes on my Wasabi Nanostation M2 nodes occasionally crash under heavy load, and it does look to be well coupled with memory exhaustion from some misbehaving process. This is happening even with 3MB of zram swap, and furthermore even with vm.swappiness=0 specified in /etc/sysctl.conf. Although, noticeably fewer crashes with swappiness turned all the way down.

When the hostapd or wpa_supplicant processes crash, you will see whichever wireless wireless VIF that process manages (e.g. the mesh adhoc VIF, the private AP) become unresponsive, even though the SSID remains visible. Note the process names "hostapd" and "wpa_supplicant" appear in the process table, regardless of if you're using the wpad or wpad-mini packages.

You can verify such crashes with the presence of files like these in /tmp:
/tmp/wpa_supplicant.1497.11.1393700006.core
/tmp/hostapd.1336.11.1384486446.core

@andygunn
Copy link

andygunn commented Mar 6, 2014

@areynold Sure - I will need to pass along instructions to folks in Detroit, is there a wiki page or existing set of documentation to test with. Where are the crashlog scripts? @elationfoundation can you send them to me?

@seamustuohy
Copy link
Collaborator

I have the notes for testing for restarting nodes in the "I think my node is restarting. How can I tell?" Section of https://wiki.commotionwireless.net/doku.php?id=development_resources:router:troubleshooting_routers This will create a file on the node that will log whenever it restarts.

@jheretic
Copy link
Member

What we're seeing at this point is that there are lots of different memory conditions that can cause a node to crash. While there are still memory leaks we're dealing with, the original leak that this issue was pointing to is way out of date, so I'm closing this out in favor of other, more specific issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants