Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP32: Memory leak when continuously sending data to client #2014

Closed
devsaurus opened this issue Jun 25, 2017 · 15 comments
Closed

ESP32: Memory leak when continuously sending data to client #2014

devsaurus opened this issue Jun 25, 2017 · 15 comments

Comments

@devsaurus
Copy link
Member

Expected behavior

I have some server code scripts that send data to a client which work without memory problems on the current esp8266 firmware.

Actual behavior

When running the same code on dev-esp32 heap decreases constantly and at some point the wifi disconnects from the AP. Symptoms are similar to what @LDmitryL reported here #1970 (comment), but I don't think it's actually related to #1970 for following reasons:

  • Client does not send data right after connect/accept.
  • The leak happens during a persistent connection, no disconnect/reconnect involved whatsoever.

Test code

I tried to isolate the problem, but can't reproduce it in any artificial test case other than my vnc server framework. I've set up a debug branch which will output some diagnostic messages at https://github.com/devsaurus/nodemcu-vncserver/tree/debug.

Steps to reproduce:

  • Required modules wifi, net, bit, struct
  • Upload vncserver.lua and rectangles.lua to ESP32.
  • Connect to AP.
  • dofile("rectangles.lua")
  • Start vnc client on a PC or smartphone and connect to ESP32:
    vncviewer <ip_of_esp32>
  • Observe decreasing heap while vnc window is continuously updated from ESP32.
dofile("rectangles.lua")
> client connected
start draw heap: 210552
end draw heap: 208588

start draw heap: 208700
end draw heap: 208324

start draw heap: 206792
end draw heap: 206780

start draw heap: 205096
end draw heap: 204980

start draw heap: 203344
end draw heap: 203248

start draw heap: 201600
end draw heap: 201472

start draw heap: 199820
end draw heap: 199732

start draw heap: 198076
end draw heap: 197988

start draw heap: 196332
end draw heap: 196220

start draw heap: 194576
end draw heap: 194468

start draw heap: 192812
end draw heap: 192716

start draw heap: 191068
end draw heap: 190972

start draw heap: 189312
end draw heap: 189216

start draw heap: 187564
end draw heap: 187468

I (12269) wifi: pm start, type:0

start draw heap: 185796
end draw heap: 185700

client disconnected
client disconnected

<I stopped vncviewer on the PC and started it again>

client connected
start draw heap: 178936
end draw heap: 176936

start draw heap: 177064
end draw heap: 176696

start draw heap: 175176
end draw heap: 175148

start draw heap: 173484
end draw heap: 173372

start draw heap: 171724
end draw heap: 171604

start draw heap: 168372
end draw heap: 168096

start draw heap: 166436
end draw heap: 166348

start draw heap: 164688
end draw heap: 164592

start draw heap: 162932
end draw heap: 162852

start draw heap: 160952
end draw heap: 161088

I (25309) wifi: bcn_timout,ap_probe_send_start
I (27819) wifi: ap_probe_send over, resett wifi status to disassoc
I (27819) wifi: state: run -> init (1)
I (27819) wifi: pm stop, total sleep time: 0/15542576

I (27819) wifi: n:9 0, o:9 0, ap:255 255, sta:9 0, prof:1
client disconnected

NodeMCU version

dev-esp32 branch.

Hardware

ESP32 board.

@jmattsson
Copy link
Member

This is likely due to multithreading issues and races. For the ESP32 we will need to rewrite the net module to not use the raw lwIP api.

@devsaurus
Copy link
Member Author

This is likely due to multithreading issues and races.

Are these issues specific to our integration of esp-idf or should/could we raise this to upstream with some more generic statement?

They're discussing the symptom I (25309) wifi: bcn_timout,ap_probe_send_start in espressif/esp-idf#724

@jmattsson
Copy link
Member

They're specific to our net module in the ESP32 branch. Upon revising the rawapi.txt file it states very clearly that none of the rawapi functions may be used outside the tcpip thread, as there is no locking in the core lwIP. Of course, we use them exclusively outside the tcpip thread... >.<

@devsaurus
Copy link
Member Author

Upon revising the rawapi.txt file it states very clearly that none of the rawapi functions may be used outside the tcpip thread

Agree, the statements there are pretty clear http://git.savannah.nongnu.org/cgit/lwip.git/tree/doc/rawapi.txt#n31

Which options do we have now?

  • Switch over to the Netconn API?
    But this one's synchronous with blocking and stuff. Can't image how that would fit to our net API.
  • Route all raw functionality through the "tcpip_thread" by means of tcpip_callback_with_block()?
    That would involve some major refactoring of the net code IMO, but might be a way out.
  • Others...?

@jmattsson
Copy link
Member

The netconn API does have some event callback functionality, and claims to also do nonblocking, so I think it should be possible to use it as the foundation instead.

@devsaurus
Copy link
Member Author

The netconn API does have some event callback functionality

Right, there's a callback for the application, missed that. Though things are quite specific to fit the needs of the socket API. Will have a look.
Do you have any commits pending that you plan to push before I take net apart?

@jmattsson
Copy link
Member

It would probably be worth to merge in https://github.com/jmattsson/nodemcu-firmware/tree/esp32-net-accept-race as band-aids in the interim. I also wanted to change from tcp_abort() to tcp_close() in the socket:close() function, but I haven't had a chance to sit down and do/test that.

@devsaurus
Copy link
Member Author

It took me some time to get my head around lwip and netconn, but there's a working prototype now at https://github.com/devsaurus/nodemcu-firmware/tree/esp32-netconn.
It's half-way-through WIP with just the TCP client and server functionality ported over to netconn (UDP is still on raw API), but it shows where the journey will lead to. There are a lot of #ifndefs since I port things incrementally and switch back to raw API for immediate comparison.

And for the good news: It passes my server stress test with

  • no memory leaks!
  • no wifi hiccups or sudden drop-outs!
  • better networking performance!

Reading through the lwip code I found several statements related to the general accept & receive race we also have in #1970. Not sure yet how this corner case is addressed exactly on netconn level and whether we'd benefit instantly, but I'll give it a try as soon as I have a suitable test case setup.

@jmattsson
Copy link
Member

Great effort, cheers!

@LDmitryL
Copy link

LDmitryL commented Aug 2, 2017

@devsaurus Uploaded latest changes for developers-esp32 and after make all get the error messages:

CC spi_master.on
CC u8g2.on
C:/msys32/home/LDL/nodemcu-firmware/components/modules/u8g2.c:581:12: error: 'ldisplay_i2c' defined but not used [-Werror=unused-function]
static int ldisplay_i2c( lua_State *L, setup_fn display_setup_fn_t )
              ^
C:/msys32/home/LDL/nodemcu-firmware/components/modules/u8g2.c:645:12: error: 'ldisplay_spi' defined but not used [-Werror=unused-function]
static int ldisplay_spi( lua_State *L, setup_fn display_setup_fn_t )
             ^
cc1.exe some warnings are treated as errors
make[1]: *** [/home/LDL/nodemcu-firmware/SDK/esp32-ESP-ISO/make/component_wrapper.MK:211: u8g2.o] Error 1
make: *** [c:/msys32/home/LDL/nodemcu-firmware/sdk/esp32-esp-idf//make/project.mk:387: module-build] Error 2

What happened?

@LDmitryL
Copy link

LDmitryL commented Aug 2, 2017

@devsaurus If enable the module U8G2 all compiles without errors. But I don't need it in this project.

@devsaurus
Copy link
Member Author

Thanks for reporting @LDmitryL. Has been fixed with 4375e09.

@LDmitryL
Copy link

LDmitryL commented Aug 4, 2017

@devsaurus Began testing his WWW server with a new API. Noticed some peculiar behavior.

  1. At first I was sending data in a loop. In the old API the download was 23 seconds. The new API is less than 5 seconds. Rewrite the transfer function as recommended (via callback). Transfer speed returned to previous values (18 seconds). What can be the reason, why fell speed?
  2. When files are requested one at a time, no problem. But if the request is a real browser that sometimes there is a connection reset error. I guess that in the moment when I pass the file to the browser from the browser comes another request which resets the current transfer. On the old API with this situation I faced. There is not really to complete the transfer I am inside the events receive and do not respond to new connections. And although it does not happen often to check I simulated this situation by sending requests from two browsers at the same time (or press F5 in the browser when loading the file). In both cases, the error occurs one. Maybe it's a feature of the behavior of the new API? And how can the new API does not respond to a new request before the end of the data transfer from the buffer to the client?

And despite the arising problems (compared to the old API) this version is much better. Thank you for your work!

@LDmitryL
Copy link

LDmitryL commented Aug 4, 2017

@devsaurus After an error connection reset the normal operation is not possible. Data reception is successful. But sending data is always with error connection reset. Only a reboot helps. Or wait for the timeout and repeat the request later. But in the browser I can not see that we are caught in an endless error.

@devsaurus
Copy link
Member Author

@LDmitryL please open a new issue for your observations and attach your Lua code so I can try to reproduce and investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants