Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esp32: sockets become unresponsive after several larger transactions #12819

Open
Tobinator32 opened this issue Oct 27, 2023 · 51 comments
Open

esp32: sockets become unresponsive after several larger transactions #12819

Tobinator32 opened this issue Oct 27, 2023 · 51 comments

Comments

@Tobinator32
Copy link

hello,
im encountering massive issues with using microdot or microwebsrv2 with serving ~5 request of ~10kb each on esp32 wroom-32 (no additional memory). Worst case is an immediate unresponsiveness of any new connection. It stops serving the opened connections and doesnt accept any new ones which forces me to do a hard reset to recover functionality. Reducing backlog of async web_server to 2 doesnt really help.
This is issue is consistently replicable with the latest micropython build with idf 5.1 or the latest µP bin from the website.
When using 1.19.1 with idf 4.4.4 it runs for hours serving 100kb per website with two clients connected and requests send every 10s for hours without any issue.
I already made an example for the microdot dev available here: https://github.com/Tobinator32/microdot_crash_demo
Any advices how to get it up and running with the latest µP and IDF would be greatly appreciated.

@robert-hh
Copy link
Contributor

Dies that problem also happen when you used idf 5.0.2? I'm asking because MicroPython uses v5.0.2.

@Tobinator32
Copy link
Author

Tobinator32 commented Oct 27, 2023

will try it later/tomorrow. But using the latest bin from micropy for esp32 from the website yields the same behaviour.

@shariltumin
Copy link

You can try disabling power management

import socket, select, time, network
network.WLAN(network.AP_IF).active(False)
sta = network.WLAN(network.STA_IF)
sta.active(True)
sta.config(pm=sta.PM_NONE)

@shariltumin
Copy link

Doing this after each hard rest seems to give a stable WiFi for me.

import socket, select, time, network
network.WLAN(network.AP_IF).active(False)
sta = network.WLAN(network.STA_IF)
sta.active(True)
sta.config(pm=sta.PM_NONE)
sta.config(txpower=18)

Give it a try.

@Tobinator32
Copy link
Author

Do you have tried it with my demo i provided and can you replicate the issue. Does it go away with your proposed settings?

@shariltumin
Copy link

No, I didn't try your crash demo. I'll leave that to you); What I have noticed is that that with the above mentioned setting I see less of the

...
D (1985209) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd7c8
...

in the debugging log.

@shariltumin
Copy link

There is one more thing you can try. Build your own custom firmware with these HW_ENABLE_MDNS disabled in "mpconfigport.h".

#define MICROPY_HW_ENABLE_MDNS_QUERIES      (0)
#define MICROPY_HW_ENABLE_MDNS_RESPONDER    (0)

I have not seen esp_netif_get_ip_info looping after this. The result is promising.

@Tobinator32
Copy link
Author

what implications result of the esp not printing esp_netif_get_ip_info in the first place?

@shariltumin
Copy link

As I understand it

D (953554) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3fcc222c

The esp_netif process will start executing the esp_netif_get_ip_info function when the TCPIP stack detects that it is losing its IP.

This is good if it works as it should, but it seems that enabling hardware DNS query/responder makes the wifi system slow and unstable.

There must be a good reason why MicroPython has hardware DNS lookup/responder enabled by default, but I cannot answer that.

My test telnet server running on an ESP32 is more stable now that I have disabled the HW_ENABLE_MDNS in the firmware.

@maze1980
Copy link

maze1980 commented Oct 29, 2023

Summary:
I can confirm there are issues with ESP32_GENERIC-20231005-v1.21.0.bin
I can confirm there are no issues with ESP32_GENERIC-20220618-v1.19.1.bin, tested with hundreds of requests
In my opinion the bug is ESP32 specific, and I would suggest to add the corresponding tags (chip and firmware)

Full text:
Sockets become unresponsive after 50..100 subsequent requests (with time.sleep(0.5) between requests). I have a simple web server that will just reply with 'Hello World' and after 75 requests the response time will increases to seconds. If you're unlucky even the first requests will take long, but most often the issues start after around 75 requests. If you then pause a few seconds, it will get better again, but hardly perfect. The web server is single threaded, no parallel threads and it calls gc.collect() before listening to a new request. And I tested with microWebSrv, too.

And of course I tried
sta.config(pm=sta.PM_NONE)
sta.config(txpower=18)
but this didn't change anything.

Board: ESP32 / WROOM
Firmware: https://micropython.org/resources/firmware/ESP32_GENERIC-20231005-v1.21.0.bin
Printed version string: "MicroPython v1.21.0 on 2023-10-05; Generic ESP32 module with ESP32"

Response times as logged per request:
Response time for request 1: 1.035897 seconds Response time for request 2: 0.119372 seconds Response time for request 3: 0.20955 seconds Response time for request 4: 0.556568 seconds Response time for request 5: 0.122106 seconds [...] Response time for request 63: 0.574087 seconds Response time for request 64: 0.12019 seconds Response time for request 65: 0.198902 seconds Response time for request 66: 0.573023 seconds Response time for request 67: 2.265523 seconds Response time for request 68: 0.210223 seconds Response time for request 69: 0.561316 seconds Response time for request 70: 2.254283 seconds Response time for request 71: 4.883501 seconds Response time for request 72: 4.211261 seconds

@Tobinator32
Copy link
Author

Thanks for your contribution. Have you tried increasing the transmitted size to roughly 10kb at least in one direction. This way im having trouble with the first few transactions and it doesnt respond to new sockets immediatly.
Kind regard Tobi

@maze1980
Copy link

No, I didn't run your code, nor did I modify my code to test it with bigger request. Just tested with my setup (that works on the old firmware) with the newest firmware.

@shariltumin
Copy link

I ran some tests where an ESP32 runs a simple http server and MicroPython on a LinuxMint runs a simple http client. The logs were from the client script on the Linux PC.

The first test was done on an ESP32 flashed with a firmware build using ESP-IDFv5.0.2. At some point during the test, the ESP32 became unresponsive for 63361ms.

The second test was on the same ESP32 flashed with firmware based on ESP-IDFv4.4.5. The Wifi/TCPIP stack behaves better with the IDFv4.4.5 based firmware. There was a time when we had to wait 342ms.

I think this is a problem. We need a stable wifi running on the ESP32. The ESP-IDFv4xx is no longer supported for future MicroPython versions. The v1.20.0-206-g33b403dfb was the copy I had that was based on ESP-IDFv4xx.

The ESP-IDFv5.0.2 based firmware has broken WiFi. Different build settings and different wlan.config settings did not help much (or at all).

The problems may be caused by ESP-IDFv5xx and not necessarily by the MicroPython core.

  1. ESP-IDFv5.0.2 MicroPython v1.22.0-preview.63.g2fda94c28 ESP32
...
Hello World! 15ms Total: 10962/51.55 min: 14ms max: 30451ms avg: 212.6479146459748Bytes/s
Hello World! 31ms Total: 10976/51.581 min: 14ms max: 30451ms avg: 212.7915317655726Bytes/s
Hello World! 63361ms Total: 10990/114.942 min: 14ms max: 63361ms avg: 95.61343982182319Bytes/s
Hello World! 25ms Total: 11004/114.967 min: 14ms max: 63361ms avg: 95.71442239947115Bytes/s
Hello World! 100ms Total: 11018/115.067 min: 14ms max: 63361ms avg: 95.75290917465477Bytes/s
Hello World! 23ms Total: 11032/115.09 min: 14ms max: 63361ms avg: 95.85541749934833Bytes/s
...
  1. ESP-IDFv4.4.5 MicroPython v1.20.0-206-g33b403dfb ESP32
...
Hello World! 32ms Total: 14448/28.88999999999987 min: 15ms max: 316ms avg: 500.1038421599192Bytes/s
Hello World! 25ms Total: 14462/28.91499999999987 min: 15ms max: 316ms avg: 500.1556285664903Bytes/s
Hello World! 342ms Total: 14476/29.25699999999987 min: 15ms max: 342ms avg: 494.7875722049446Bytes/s
Hello World! 27ms Total: 14490/29.28399999999987 min: 15ms max: 342ms avg: 494.8094522606224Bytes/s
Hello World! 24ms Total: 14504/29.30799999999987 min: 15ms max: 342ms avg: 494.8819434966584Bytes/s
...

@dpgeorge
Copy link
Member

Thanks for the report. I'm able to reproduce the problem using the microdot example in the original post here. With v1.20.0 responses are returned regularly, within 1 second for the vast majority of requests. With current master there are responses that take a long time, up to 100 seconds.

@dpgeorge
Copy link
Member

@Tobinator32 can you please try the firmware that is broken, and at the start of each HTTP request print out the available memory, like this:

     import micropython, esp32
     micropython.mem_info()
     print(esp32.idf_heap_info(1), esp32.idf_heap_info(4))

From my tests it seems that the IDF is slowly running out of heap memory, and when there's none left it just blocks the TCP traffic.

@Tobinator32
Copy link
Author

Thanks for the Answer. Yes i can try it in the evening or tomorrow.

@Tobinator32
Copy link
Author

Tobinator32 commented Nov 3, 2023

crash_log.txt
Tried it out with your additional commands. See the console output. As expected it crash soon after requesting some pictures. Feel free to tell me your thoughts about it. Maybe its worth to mention that even the already established websocket connection stops receiving messages. So its not all about the overhead introduced by opening and closing TCP sockets

@maxi07
Copy link

maxi07 commented Nov 5, 2023

Actually I am experiencing a similar issue with the Raspberry Pico W with Micropython and Microdot. At some point, the web sockets become unresponsive with hard reset being the only solution.

@shariltumin
Copy link

I get good results with this setting in the "boards/ESP32_GENERIC/sdkconfig.debug" file:

# Enable debug
CONFIG_LOG_DEFAULT_LEVEL_DEBUG=y

# LWIP TCP setting
CONFIG_LWIP_TCP_MSL=6000
CONFIG_LWIP_TCP_FIN_WAIT_TIMEOUT=2000

The firmware was compiled with the board variant

make V=1 BOARD=ESP32_GENERIC BOARD_VARIANT=DEBUG

Perhaps others can take the time to try and varify for themselves.

@dpgeorge dpgeorge changed the title sockets become unresponsive after several larger transactions esp32: sockets become unresponsive after several larger transactions Nov 6, 2023
@dpgeorge
Copy link
Member

dpgeorge commented Nov 6, 2023

Actually I am experiencing a similar issue with the Raspberry Pico W with Micropython and Microdot. At some point, the web sockets become unresponsive with hard reset being the only solution.

@maxi07 the issue here was first reported on esp32, and likely the cause is specific to that platform.

For the problem that you see on Pico W, can you please open a separate issue and describe the bug there, and also provide code that can be run to reproduce your issue. Then it can be investigated.

@projectgus
Copy link
Contributor

projectgus commented Nov 7, 2023

EDIT: Sorry, some of this is wrong but unclear which part. Will debug some more and post again.

@projectgus
Copy link
Contributor

Using @dpgeorge's reproduction code (esp32 port server, unix port client), I can confirm at least one problem is related to port reuse in the client and the TIME_WAIT state on the server:

Recall that each IPv4 TCP connection is uniquely identified by a 4-tuple: source IP address, destination IP address, source port, destination port.

  • A client that repeatedly connects to a server should use a new source port each time (the rest of the tuple will be the same.)
  • Each time the server closes a connection, it is required to keep the closed socket in the TIME_WAIT state for (2 * TCP MSL). The default MSL on ESP32 is 60 seconds, so this is 2 minutes.
  • If a client tries to open a new connection reusing a previously used source port while the closed socket is in TIME_WAIT, then I see the server ACKing it using the old TIME_WAIT socket's sequence number. The client should see this is incorrect and send back an RST to close the connection. In my tests, the server also ACKs this RST packet and the server and client get stuck here (client sending RSTs, server ACKing them based on the old socket) until the TIME_WAIT socket eventually times out. The client may time out before this happens.

On MicroPython V1.20 this doesn't seem to happen. I've seen duplicate ports within 30 seconds of each other, and the server accepts them. In theory LWIP is configured with the same MSL period for TIME_WAIT. I think this could one of multiple things:

  • A bug in the older ESP-IDF lwIP version, meaning TIME_WAIT sockets are timing out sooner than they should.
  • V1.20 is running into a resource limit. Under some circumstances lwIP will clean up TIME_WAIT sockets early to save resources, and maybe this is happening continuously.
  • A bug in the newer LWIP version. For example, perhaps it should be sending valid ACKs to restore the TIME_WAIT socket, as allowed for in RFC1122 and the invalid ACK is the bug.

Workarounds

  • Setting CONFIG_LWIP_TCP_MSL to a much lower value, as described by @shariltumin two posts above, should prevent this (or, at worst, shorten the delay if it happens). This has the side effect of potentially leading to other TCP issues on unreliable or slow networks, but setting a value lower than the default 60000 and higher than (say) 5000 could be a good compromise.
  • Closing the connection from the client not the server will also prevent this. Only the end of the connection that initiates closing can end up in TIME_WAIT. If the client initiates closing then it will end up in TIME_WAIT, and the server can immediately forget about the old connection.
  • Refactor the application to use long-lived TCP connections. I know this is easier said than done, but I thought I'd mention it.

@Tobinator32 it's not 100% clear that this is even the root cause for the issue you initially reported. Are you able to test one of these workarounds (suggest the first one of MSL) and see if the problem goes away?

@Tobinator32
Copy link
Author

do you mean the suggestion from @shariltumin only? The thing is my webserver demo doesnt have the exact same issue as it starts freezing in the middle of a file transfer not after establishing new connection. Yes it refuses new connections but only after the file transfer has crashed. Im slowly thinking that the described issues are a part of the problem im seeing but not the whole picture. Maybe you can try the demo i uploaded on github. Just use the latest micropython binary and upload the few files.

@projectgus
Copy link
Contributor

do you mean the suggestion from @shariltumin only? The thing is my webserver demo doesnt have the exact same issue as it starts freezing in the middle of a file transfer

I did mean that suggestion (sorry for the vague wording), but I had a feeling this might be a different issue. I'll try your reproduction code soon.

@projectgus
Copy link
Contributor

projectgus commented Nov 15, 2023

Hi @Tobinator32,

Thanks for explaining again. I was able to reproduce most of what you describe with your demo project. In my case, although I saw some transfers fail (hang part way) a subsequent transfer would always succeed if retried after a short time (but it might take a minute or so to recover in the worst case.)

My reproduction strategy was very simple, running two concurrent terminals with:

while wget --tries=1 --timeout=10 -O/dev/null http://192.168.4.1/new/images/200_1.jpg; do; done 

and

while wget --tries=1 --timeout=10 -O/dev/null http://192.168.4.1/new/images/200.jpg; do; done 

... I think with more concurrent transfers it'd fail sooner, but I could reproduce within a minute or so with just two.

There are two different failure modes:

  1. The ESP32 is intermittently running out of memory in the ESP-IDF heap. There isn't a lot of free memory to begin with, with .py files MicroPython is taking 128KB for its heap by the time the app is running (with .mpy it stayed within 64KB). When IDF heap runs out, the current TCP connections stall (including mid-transfer). If the HTTP client retries after a minute or so, everything works again until the server runs out of heap again.
  2. The problem mentioned above with local socket reuse can cause an HTTP request to fail to open the connection if the old socket is still lurking in TIME_WAIT. The socket will keep retrying until it times out. Once the client tries again with a new local socket port then this one should work immediately, though.

Mostly I think the problems you're seeing are (1) not (2), but I saw both.

Sockets in TIME_WAIT seem to be the main culprit. I hadn't realised until today, LWIP doesn't place any hard limit on the number of these in the system - if you make requests fast enough then they'll keep growing and taking up RAM until they time out. In my tests I could see over 250 (the structure is 208 bytes plus heap overhead, so this would be over 50KB of RAM!)

Until there's a patch for that, the best workaround is to rebuild MicroPython with CONFIG_LWIP_TCP_MSL=5000, using the method described by shariltumin, so sockets in TIME_WAIT are cleaned up much sooner (10 seconds vs 2 minutes).

@Tobinator32 are you able to please give that a try, and see if the problem still appears for you?

@Tobinator32
Copy link
Author

Hello @projectgus,
First of all you are correct with your assumptions. The match pretty much with my observations. What i miss somehow is that a transfer of lets say 5 pictures simultanously (The demosite has pages with more pictures though). In this case it fails mid-transfer for most of them and one maybe finishes. In this case no timeouts (your second bullet point) occur but the transfer is stuck and the esp32 doesnt accept any new transfers. With IDF4.x and Micropython 1.91.1 this runs fine for hours with a 10s automatic page reload. There has to be some other mechanism as well...
Kind Regard Tobi

@projectgus
Copy link
Contributor

In this case it fails mid-transfer for most of them and one maybe finishes. In this case no timeouts (your second bullet point) occur but the transfer is stuck and the esp32 doesnt accept any new transfers.

When the free IDF heap runs out, this is what I see as well (point 1). Transfers stop mid-way, packet capture shows nothing happening at all. In my testing I found that it will accept a new connection after some time (sometimes minutes) but it otherwise looks like it's completely stuck (and maybe some of the cases where it runs out of RAM it will break worse than this).

Are you able to try and reproduce with CONFIG_LWIP_TCP_MSL=5000 and see what you observe?

@shariltumin
Copy link

Tests were performed on ESP-WROOM-32 boards (no PSRAM). CPU speed was 24Mhz.
Use only socket.read() and socket.write(), no recv or send were used.

Test1

Server with official MicroPython

MPY: soft reboot
MicroPython v1.22.0-preview.67.g64c79a542 on 2023-10-30; Generic ESP32 module with ESP32
Type "help()" for more information.
>>> import http_rw_srv
('192.168.4.91', '255.255.252.0', '192.168.4.1', '192.168.4.1')
Service: ('192.168.4.27', 42700)
req: b'GET / HTTP/1.1\r\nHost: 192.168.4.85\r\n\r\n'
Number of bytes sent: 10445
...

Unix MicroPython (upy) as the client

$ time upy http_rw_cln.py
------------------------------------
Connected after 11ms min:11ms max:11ms avg:11.000ms timeouts:0
PKT# 1 Bytes read: 10445 w-err: 0 r-err: 0
EXEC:[(15072, 4, 0, 4), (113840, 90280, 86016, 88288), (4224, 3820, 3584, 3804)]
DATA:[(240, 4, 0, 4), (7288, 4, 0, 4), (16648, 4, 0, 4), (87144, 4, 0, 4), (15072, 4, 0, 4), (113840, 90280, 86016, 88288)]
 HelloWorld@154464990000 HelloWorld
rld@154464990000 HelloWorld@154464990000 HelloWorld@154464990000 HelloWorld@1544
 115ms min:115 max:115 avg:115.000, Rate:90826.087Bytes/s Total:10445/0.115
...
...
Connected after 3ms min:2ms max:316ms avg:5.283ms timeouts:358
PKT# 5000 Bytes read: 10441 w-err: 0 r-err: 155
EXEC:[(15072, 4, 0, 4), (113840, 21308, 8192, 4340), (4224, 3820, 3584, 3804)]
DATA:[(240, 4, 0, 4), (7288, 4, 0, 4), (16648, 4, 0, 4), (87144, 4, 0, 4), (15072, 4, 0, 4), (113840, 21308, 8192, 4340)]
 HelloWorld@1965395438000 HelloWorld@19
d@1965395438000 HelloWorld@1965395438000 HelloWorld@1965395438000 HelloWorld@196
 144ms min:30 max:1234 avg:150.731, Rate:69271.132Bytes/s Total:52206466/753.654

real	30m11,008s
user	0m5,678s
sys	0m4,460s

Test2

Server with custom compiled with debugging log enabled and some LWIP config.

MPY: soft reboot
MicroPython v1.22.0-preview.63.g2fda94c28.dirty on 2023-11-15; Generic ESP32 module with ESP32-DEBUG
Type "help()" for more information.
>>> import http_rw_srv
(Some debugging log output)
('192.168.4.92', '255.255.252.0', '192.168.4.1', '192.168.4.1')
Service: ('192.168.4.27', 58676)
req: b'GET / HTTP/1.1\r\nHost: 192.168.4.85\r\n\r\n'
Number of bytes sent: 10442
...

Unix MicroPython (upy) as the client

$ time upy http_rw_cln.py
------------------------------------
Connected after 16ms min:16ms max:16ms avg:16.000ms timeouts:0
PKT# 1 Bytes read: 10442 w-err: 0 r-err: 0
EXEC:[(15072, 4, 0, 4), (113840, 90352, 86016, 58868), (1260, 928, 896, 912)]
DATA:[(240, 4, 0, 4), (7288, 4, 0, 4), (16648, 4, 0, 4), (87424, 4, 0, 4), (15072, 4, 0, 4), (113840, 90352, 86016, 58868)]
 HelloWorld@5756550427000 HelloWorld@5
d@5756550427000 HelloWorld@5756550427000 HelloWorld@5756550427000 HelloWorld@575
 86ms min:86 max:86 avg:86.000, Rate:121418.605Bytes/s Total:10442/0.086
...
...
Connected after 12ms min:5ms max:75ms avg:11.284ms timeouts:6
PKT# 5000 Bytes read: 10442 w-err: 0 r-err: 1
EXEC:[(15072, 4, 0, 4), (113840, 74444, 69632, 55916), (1260, 928, 896, 912)]
DATA:[(240, 4, 0, 4), (7288, 4, 0, 4), (16648, 4, 0, 4), (87424, 4, 0, 4), (15072, 4, 0, 4), (113840, 74444, 69632, 55916)]
 HelloWorld@6571574805000 HelloWorld@6
d@6571574805000 HelloWorld@6571574805000 HelloWorld@6571574805000 HelloWorld@657
 154ms min:86 max:300 avg:159.629, Rate:65414.314Bytes/s Total:52210108/798.145

real	13m35,101s
user	0m5,601s
sys	0m4,466s

Summing up

Test1 took 30m11s to complete for 5000 packets
Connection: min:2ms max:316ms avg:5.283ms. Timeout errors: 358
Read/Write error: w-err: 0 r-err: 155
Packet throughput: min:30ms max:1234ms avg:150.731ms
Packet Total:52206466/753.654 at Rate:69271.132Bytes/s
Test Rate: 52206466/1811 = 28827.425Bytes/s

Test2 took 13m35s to complete for 5000 packets
Connection: min:5ms max:75ms avg:11.284ms Timeout errors: 6
Read/Write error: w-err: 0 r-err: 1
Packet throughput: min:86ms max:300ms avg:159.629ms
Packet Total:52210108/798.145 at Rate:65414.314Bytes/s
Test Rate: 52210108/815 = 64061.482Bytes/s

Timeout occurs when a connect request from the client is not answered after 1 second. The client will then sleep for 1 second. Each timeout contributes to 2 seconds of waiting on the on the client side.

We have an w-err when an error occurs during the socket.write() operation on the client's HTTP request. An r-err occurs during socket.read() while the client is receiving the response from the server.

Test1 reported 358 timeouts and 155 r-err. When these happen, the server reports these errors

Service: ('192.168.4.27', 54966)
req: b'GET / HTTP/1.1\r\nHost: 192.168.4.85\r\n\r\n'
Send Error: [Errno 113] ECONNABORTED
Accept Error: [Errno 128] ENOTCONN
Sleep 3 sec

The connection from the server to the client has been broken (ECONNABORTED) because the server has been disconnected from the WiFi (ENOTCONN). The server will eventually reconnect and continue to listen for a new connection.

If it works, Test1 gives a better throughput, 69271.132 Bytes/s compared to Test2, 65414.314 Bytes/s.

The Test2 runs custom firmware. All are the same as ESP32_GENERIC except for the board variant defined by boards/ESP32_GENERIC/sdkconfig.debug.

# enable debug
CONFIG_LOG_DEFAULT_LEVEL_DEBUG=y

# LWIP setting
CONFIG_LWIP_TCPIP_TASK_PRIO=23
CONFIG_LWIP_SO_LINGER=y

CONFIG_LWIP_PPP_ENABLE_IPV6=n

CONFIG_LWIP_TCP_TMR_INTERVAL=100
CONFIG_LWIP_TCPIP_TASK_AFFINITY_CPU0=y

CONFIG_LWIP_TCP_MSL=6000
CONFIG_LWIP_TCP_FIN_WAIT_TIMEOUT=2000

These values are experimental. Some of them may even be "wrongly" configured.

Why?

I have a personal interest in a stable and fast Wifi and TCP/IP socket
implementation in MicroPython, as I am developing custom firmwares with
camera support and need to stream image frames over HTTP as fast as possible.

@maze1980
Copy link

I'd say these should be added to the default:
CONFIG_LWIP_SO_LINGER=y
CONFIG_LWIP_TCP_FIN_WAIT_TIMEOUT=2000
CONFIG_LWIP_TCP_MSL=6000

But
CONFIG_LWIP_TCP_TMR_INTERVAL=100
will cause more re-transmits as the re-transmit timer expires faster. Don't change it from the default value unless you change it on both sides.

And if you disable debugging, things should go faster.

@projectgus
Copy link
Contributor

will cause more re-transmits as the re-transmit timer expires faster. Don't change it from the default value unless you change it on both sides.

Yes. It's easy to tune TCP stacks for settings that work well in one context but make you a very bad network peer in other contexts. The RFC-specified TCP settings have their flaws: they mostly assume 1970s/80s ARPAnet/early-internet network configurations and mostly networks don't look like that any more. However, there are still places that more conservative settings end up being needed (for example, cellular links can easily see huge latency spikes.)

Test1 took 30m11s to complete for 5000 packets
Test2 took 13m35s to complete for 5000 packets

@shariltumin Are you able to share your test code please, or to take a complete packet capture of your test runs in Wireshark? That can help confirm if these are related problems or different problems (I suspect the same problems, but hard to know.)

These values are experimental. Some of them may even be "wrongly" configured.

I'm curious if you change only CONFIG_LWIP_TCP_MSL=6000 and re-run the test, what do you see then?

@shariltumin
Copy link

@projectgus, you can find the relevant files here

@shariltumin
Copy link

Good news!
These are the results of running tests using only

 CONFIG_LWIP_TCP_MSL=6000

and nothing else changed for ESP32_GENERIC

>>> 
MPY: soft reboot
MicroPython v1.22.0-preview.164.gfce8d9fd5.dirty on 2023-11-21; Generic ESP32 module with ESP32-GUS
Type "help()" for more information.
>>>

Test1:

Connected after 4ms min:3ms max:173ms avg:8.792ms timeouts:0
PKT# 5000 Bytes read: 10442 w-err: 0 r-err: 0
EXEC:[(15072, 4, 0, 4), (113840, 73632, 69632, 55700), (1176, 844, 832, 828)]
DATA:[(240, 4, 0, 4), (7288, 4, 0, 4), (16648, 4, 0, 4), (87384, 4, 0, 4), (15072, 4, 0, 4), (113840, 73632, 69632, 55700)]
 HelloWorld@854534054000 HelloWorld@85
rld@854534054000 HelloWorld@854534054000 HelloWorld@854534054000 HelloWorld@8545
 149ms min:104 max:347 avg:155.963, Rate:66951.962Bytes/s Total:52210144/779.815

real	13m1,744s
user	0m5,580s
sys	0m4,486s 

Test2:

Connected after 11ms min:3ms max:453ms avg:10.854ms timeouts:0
PKT# 5000 Bytes read: 10442 w-err: 0 r-err: 1
EXEC:[(15072, 4, 0, 4), (113840, 74272, 69632, 55244), (1176, 844, 832, 828)]
DATA:[(240, 4, 0, 4), (7288, 4, 0, 4), (16648, 4, 0, 4), (87384, 4, 0, 4), (15072, 4, 0, 4), (113840, 74272, 69632, 55244)]
 HelloWorld@2044193449000 HelloWorld@2
d@2044193449000 HelloWorld@2044193449000 HelloWorld@2044193449000 HelloWorld@204
 142ms min:107 max:1140 avg:159.540, Rate:65450.522Bytes/s Total:52210012/797.702

real	13m21,742s
user	0m5,624s
sys	0m4,495s

Test1 was done with no other traffic on my local home network, while Test2 was done with a PC streaming a 4K YouTube video.

Note the timeouts, w-err and r-err. It seems that the CONFIG_LWIP_TCP_MSL=6000 has solved the problem.

Even after a soft reset, I see these debugin messages

>>> 
MPY: soft reboot
MicroPython v1.22.0-preview.164.gfce8d9fd5.dirty on 2023-11-21; Generic ESP32 module with ESP32-GUS
Type "help()" for more information.
>>> D (3236179) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd66c
D (3237199) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd66c
D (3238579) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd66c
D (3239589) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd66c
D (3240979) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd66c
D (3241989) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd66c
D (3243379) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd66c
D (3244379) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd66c
D (3245789) esp_netif_lwip: esp_netif_get_ip_info esp_netif:0x3ffbd66c
...

Is there a way to reduce the rate of calling the esp_netif_get_ip_info()?

However,

network.WLAN(network.STA_IF).active(False)

statement will disable the station interface and silence the debug log.

@projectgus
Copy link
Contributor

Note the timeouts, w-err and r-err. It seems that the CONFIG_LWIP_TCP_MSL=6000 has solved the problem.

That's great, thanks @shariltumin!

@Tobinator32 are you able to try the same test, please? If it's helpful then I can upload a MicroPython build with this config somewhere for you to download.

@projectgus
Copy link
Contributor

Thanks for the great test project, btw. Much appreciated.

Is there a way to reduce the rate of calling the esp_netif_get_ip_info()?

I don't know what is calling this, I think possibly it's coming from wifi library event being processed inside ESP-IDF. "Debug" level logs tend to be a little spammy, unfortunately.

Note that once #12900 is merged it'll be straightforward to turn these on and off from Python again. (I'm not sure if that will work with the firmware you're using now.)

@shariltumin
Copy link

The frequency of the esp_netif_get_ip_info() calls can be reduced by changing the value of MDNS_QUERY_TIMEOUT_MS in "modsocket.c".

I got a good result by changing the value from 5000 to 50000.

$ diff modsocket.c modsocket.c-ORIG
59c59
< #define MDNS_QUERY_TIMEOUT_MS (50000)
---
> #define MDNS_QUERY_TIMEOUT_MS (5000)

@Tobinator32
Copy link
Author

Tobinator32 commented Nov 23, 2023

maybe i finally loop in. @projectgus yes please upload the binary somewhere so that its easier for me to test. I guess i ran my initial webserver test with http cat and see if it gets stuck or crashes. Can do this around the weekend.
And of course a huge shout out to all of you. I see really awesome work and issue tracking capabilities from you guys. Good Work.
Kind Regards Tobi

projectgus added a commit to projectgus/micropython that referenced this issue Nov 27, 2023
Default MSL is 60 seconds.

The main purpose of this is to shorten time closed sockets spend in the
TIME_WAIT state. This is (2 * MSL) so previously 2 minutes, with this
change becomes 20 seconds.

Without this, a system which makes many short-lived TCP connections can
grow to be holding hundreds of sockets in TIME_WAIT.

Possible fix for micropython#12819

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
@projectgus
Copy link
Contributor

@Tobinator32 Great, thank you. There are two ESP32_GENERIC firmwares in this zip file:

issue12819_esp32_generic_firmwares.zip

Please try flashing ESP32_GENERIC_MSL_10s.bin first (can flash using the same method as for an official firmware from the download page). If this works in your tests, great!

If "10s" version doesn't work, please test again with ESP32_GENERIC_MSL_6s.bin. It's only necessary to try the "6s" version if the "10s" version doesn't work.

The firmware is built from this branch:
https://github.com/projectgus/micropython/tree/bugfix/esp32_shorter_msl

@Tobinator32
Copy link
Author

Hello together,
sorry for the long delay but im very busy until the end of the year but managed to find some spare time to test the firmware. Unfortunately both firmwares @projectgus provided fail in the same way as before. It just gets stuck in the middle of a transfer and then doesnt do anything and doesnt recover either. Maybe you can try out my demo real quick by yourself but i had no success doing so. Any further suggestion how to fix the issue or something i can try to help?
Kind Regards Tobi

@dcox761
Copy link

dcox761 commented Mar 4, 2024

Thank you for posting this thread, it was very helpful to know a prior version 1.19 that works well.

I am using ESP32-WROOM-32D DevKitC for a small project that I need to update remotely. Due to WebREPL instability, I setup an API using Microdot to upload changed .MPY files (up to 15 ranging from 400-8500 bytes) and restart.

  • All the dependencies frozen to reduce memory
  • Regular GC (including before and after each API request), leaving approx. 65k free

With 1.22, it regularly stopped responding, sometimes after only 3 or 4 requests. Connected to USB, I could see these exceptions for OSError ECONNABORTED.

Traceback (most recent call last):
  File "asyncio/core.py", line 1, in run_until_complete
  File "aioprof.py", line 93, in send
  File "microdot.py", line 1183, in serve
  File "microdot.py", line 1295, in handle_request
  File "microdot.py", line 642, in write
  File "asyncio/stream.py", line 1, in stream_awrite
  File "asyncio/stream.py", line 1, in drain
OSError: [Errno 113] ECONNABORTED
Traceback (most recent call last):
  File "microdot.py", line 1289, in handle_request
  File "microdot.py", line 421, in create
  File "asyncio/stream.py", line 1, in readexactly
OSError: [Errno 113] ECONNABORTED
Traceback (most recent call last):
  File "asyncio/core.py", line 1, in run_until_complete
  File "aioprof.py", line 93, in send
  File "microdot.py", line 1183, in serve
  File "microdot.py", line 1295, in handle_request
  File "microdot.py", line 642, in write
  File "asyncio/stream.py", line 1, in stream_awrite
  File "asyncio/stream.py", line 1, in write
OSError: [Errno 113] ECONNABORTED

Most of the time the client would never respond, but sometimes would show:

ConnectionResetError: [Errno 54] Connection reset by peer

Additional requests could still be made from a different client. USB console shows the request received which was enough to remotely reset but responses rarely worked.

I have tried this workaround CONFIG_LWIP_TCP_MSL=6000 from above without any luck.

I was looking into options to try and catch the exceptions properly and restart microdot or the Wifi, but changing to 1.19 (IDF 4.4.4) has fixed this completely even though there is less memory available (only 20k compared to 65k). A load test that repeatedly uploaded all files, ran for over 12 hours with an occasional MemoryError.

I have a spare board to experiment with and a repeatable test, so would be happy to try any other ideas and report back.

The issues with WebREPL could be related, but not solved by changing to older version.

Best regards, Daniel.

@DeADeGG1
Copy link

DeADeGG1 commented Mar 8, 2024

@Tobinator32 Great, thank you. There are two ESP32_GENERIC firmwares in this zip file:

issue12819_esp32_generic_firmwares.zip

Please try flashing ESP32_GENERIC_MSL_10s.bin first (can flash using the same method as for an official firmware from the download page). If this works in your tests, great!

If "10s" version doesn't work, please test again with ESP32_GENERIC_MSL_6s.bin. It's only necessary to try the "6s" version if the "10s" version doesn't work.

The firmware is built from this branch: https://github.com/projectgus/micropython/tree/bugfix/esp32_shorter_msl

Hi! Thank you so much for precompiled firmwares! In my case ESP32_GENERIC_MSL_10s.bin works great!! Non of the official ones works as expected. I have tried to compile your branch on my own, also tried official master branch with CONFIG_LWIP_TCP_MSL=10000, but works like in official... Can you please tell how to compile that firmware? I have tried both IDFv5.0.2 and IDF5.1.2... I see that in the name there is 'with-newlib4.1.0', maybe the way you compile is the key for unresponsive sockets? Thanx!

@dcox761
Copy link

dcox761 commented Mar 31, 2024

I was encouraged by the most recent comment about ESP32_GENERIC_MSL_10s.bin and just downloaded and tested it on ESP32-WROOM-32D DevKitC.

Unfortunately I still see the same ECONNABORTED error as reported earlier.

This is after running for only 1 min and uploading the same 16kb file just 3 times, with the following 2 exceptions seen.

Traceback (most recent call last):
  File "/lib/microdot.py", line 1330, in handle_request
  File "/lib/microdot.py", line 420, in create
  File "asyncio/stream.py", line 1, in readexactly
OSError: [Errno 113] ECONNABORTED
Traceback (most recent call last):
  File "asyncio/core.py", line 1, in run_until_complete
  File "/lib/microdot.py", line 1224, in serve
  File "/lib/microdot.py", line 1336, in handle_request
  File "/lib/microdot.py", line 641, in write
  File "asyncio/stream.py", line 1, in stream_awrite
  File "asyncio/stream.py", line 1, in write
OSError: [Errno 113] ECONNABORTED

1.19 is still working perfectly.

Regards,
Daniel.

@tomasmez
Copy link

Hello,

I was suffering from the same bug. tested the 10s version and it improves a lot.
in my application memory is not such an issue so 10s is working. if i spam the server then it starts failing but recovery times are acceptable.

im adding myself to this issue for a follow up.

Thanks

@tomasmez
Copy link

Hello,

How did you manage to get 1.19 running? i tried installing it but asyncio is not present there and needed by microdot.

Got the binary from here: https://micropython.org/download/ESP32_GENERIC/

this is beacuse finally 10S is aborting connections again with minimal application code change.

thanks

@Tobinator32
Copy link
Author

Hi,
Was that question regarding how i obtained a 1.19 version and where the binaries are to be found?
Kind Regards Tobi

@tomasmez
Copy link

tomasmez commented Apr 18, 2024

@Tobinator32 Actually i got 1.19 from the micropython download page.
but microdot does not work in 1.19. asyncio module is not present.

@Tobinator32
Copy link
Author

the only way i had success was build 1.19 with idf 4.4.4 myself. This way its rock solid. asyncio was not an issue in my case and was running perfectly. Although stated i havent used to "official" binaries

@tomasmez
Copy link

can you share your binary? none of the options here fix the issues for me in the end.

@Tobinator32
Copy link
Author

I can look into that tomorrow. I have to build the image without some customer specific sources.

@dcox761
Copy link

dcox761 commented Apr 19, 2024

can you share your binary? none of the options here fix the issues for me in the end.

To save memory I also compile Micropython 1.19 with a manifest file to freeze all of my dependencies. My application has been stable on this version. I have been trying newer versions and combinations without any luck, as shown above.

Microdot works reliably on the downloaded ESP32_GENERIC.bin if you copy the uasyncio files from Micropython source.

git clone https://github.com/micropython/micropython
cd micropython
git checkout v1.19

Copy the whole uasyncio subdirectory (from extmod/) to your ESP32 filesystem.

@tomasmez
Copy link

can you share your binary? none of the options here fix the issues for me in the end.

To save memory I also compile Micropython 1.19 with a manifest file to freeze all of my dependencies. My application has been stable on this version. I have been trying newer versions and combinations without any luck, as shown above.

Microdot works reliably on the downloaded ESP32_GENERIC.bin if you copy the uasyncio files from Micropython source.

git clone https://github.com/micropython/micropython
cd micropython
git checkout v1.19

Copy the whole uasyncio subdirectory (from extmod/) to your ESP32 filesystem.

Thanks for this info. Tried and the error is that effectively it is running out of ram.
I created mpy files for microdot to save heap and the application works now.
tested on 1.22.2 with microdot built with mpy-cross.

my application is nothing memory intensive so i ask: what could i do to preserve more ram?

@tomasmez
Copy link

tomasmez commented Apr 20, 2024

i did not realize i was running microdot with a lot of extensions by default.
removed them and now it works.

Doing some more inspecting. it appears the garbage collector is not freeing the ram fast enough.
forced a gc.collect() within my main loop (1/sec) and the app runs stable.

Used ram:100064, Free ram:33040
Used ram:103984, Free ram:29120
Used ram:107904, Free ram:25200
Used ram:111824, Free ram:21280
Used ram:115744, Free ram:17360
Used ram:119664, Free ram:13440
Used ram:123584, Free ram:9520
Used ram:127504, Free ram:5600
Used ram:131424, Free ram:1680

right before microdot fails. each line is 1 sec appart.

in contrast. forcing a collect:

Used ram:45776, Free ram:107184
Used ram:45680, Free ram:107280
Used ram:45744, Free ram:107216
Used ram:45680, Free ram:107280
Used ram:45744, Free ram:107216
Used ram:45680, Free ram:107280
Used ram:45744, Free ram:107216
Used ram:45680, Free ram:107280
Used ram:45744, Free ram:107216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants