Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esp32: Heap fragmentation and mbedtls/lwip. #8940

Open
jimmo opened this issue Jul 21, 2022 · 21 comments
Open

esp32: Heap fragmentation and mbedtls/lwip. #8940

jimmo opened this issue Jul 21, 2022 · 21 comments

Comments

@jimmo
Copy link
Member

jimmo commented Jul 21, 2022

This is a meta-issue to track esp32 heap fragmentation issues and analysis of possible solutions and workarounds.

(See original analysis in #5543 and other reports/related work in #7038, #8628, #8662, #8251, #5355, #7061, #5219, #5808, #7214)

The high-level issue is that on, for example, a pico-d4 with IDF v4.4 the RAM layout at startup is:

Showing data for heap: 0x3ffb9a20 
Block 0x3ffbb4e8 data, size: 4452 bytes, Free: Yes 
Showing data for heap: 0x3ffc9ba8 
Block 0x3ffcf174 data, size: 69256 bytes, Free: Yes 
Showing data for heap: 0x3ffe0440 
Block 0x3ffe0a9c data, size: 13440 bytes, Free: Yes 
Showing data for heap: 0x3ffe4350 
Block 0x3ffe49ac data, size: 112208 bytes, Free: Yes 

MicroPython's current logic is that it calculates the total (8-bit capable) IDF heap size (244072), the largest contiguous free block (110592), and then tries to allocate min(244072 / 2, 110592) = 110592. (get_largest_free_block returns 110592 when that block is actually 112208.. i.e. round to 4kiB -- looking at the implementation of the IDF allocator, it does power-of-two allocations)

This leaves four fragments for the IDF heap 4452 + 69256 + 13440 + 1612 bytes.

From this, the IDF allocates for:

  • mbedtls (16kiB + 4kiB per wrapped socket for buffers, plus a further ~15kiB at peak)
  • lwip (~4-7kiB per socket)
  • wifi stack (28kiB for initialisation, 2.5kiB for active, 7kiB for connect = 45.5kiB total) (further 3.5kiB if you also enable softAP)
  • bluetooth stack (19kiB for ble.active(1), plus small extra for NimBLE service registrations).

For example, fetching a http resource. After wifi is connected, the largest available IDF heap alloc is 25600. This drops to 15616 while the socket is open (9.75kiB total) and returns to 25600 after close.

Fetching an https resource, at the point wrap_socket is called, the available IDF blocks are (22528, 13312, 1600) (total 37kiB). There's a further ~7kiB of lwip mallocs, plus the 35kiB of mbedtls. The request subsequently fails due to an OOM.

mbedtls calls malloc a lot -- the sequence of allocations up to the first free is (16717,4429,220,128,2240,16,344,1435,32,32,172,260,4,16,16,16,16,16,16,16,344,1306,32,32,32,32,172,260,4,16,16,344,1380,32,32,32,172,516,4,16,4,4,32).

This shows clearly why it's impossible to do an SSL request on IDF 4.4.

I implemented #8526 for ESP32 and used mbedtls_platform_set_calloc_free to intercept mbedtls allocs. The implementation was to reserve a contiguous 64kiB for IDF (for lwip, wifi), then use all remaining IDF blocks for split heaps. This successfully does an https request. But there can only be one concurrent request, and BLE cannot be enabled. So the problem is, in order to keep everything else working (BLE, wifi, lwip) you can only really afford to take the first block anyway, so the benefit of split heap is marginal here.

@jimmo
Copy link
Member Author

jimmo commented Jul 21, 2022

For comparison on IDF 4.2, the heap total is 246288, the largest contiguous free block is 113804, and it allocates that full amount for the MicroPython heap. After that, the remaining contiguous blocks are 76776, 15036, 4456. After connecting to wifi, the contiguous blocks are 35128, 15036. At the wrap_socket call, it's 33300, 15036 (compare to 22528, 13312, 1600 for IDF 4.4).

@dpgeorge
Copy link
Member

So the problem is, in order to keep everything else working (BLE, wifi, lwip) you can only really afford to take the first block anyway, so the benefit of split heap is marginal here.

I don't understand this conclusion: to get BLE, WiFi and lwIP working needs 19k+45.5k+7k = ~ 72kiB free in the IDF heap. For that you could leave blocks 1&2, or better 2&3, for the IDF. Then uPy could use the other two blocks for its heap. Or could leave block 4 (112208 bytes) for the IDF and uPy could use blocks 1, 2 and 3 for its heap. Wouldn't that work and allow everything to run at the same time (with one ssl socket)?

@dpgeorge
Copy link
Member

The bigger picture is that WiFi+BLE+SSL sockets requires a lot of resources, pushing and embedded device to its limit. And the IDF itself is rather hungry when it comes to RAM usage, because the WiFi and BLE subsystems run on the same SoC (in contrast to stm32 where the WiFi and BLE are off chip).

Nevertheless, it's still possible to do all these things at once, albeit with only a little RAM to spare for the actual application (PSRAM would make things a lot better).

I think it'd be great if we could be as flexible as possible with RAM usage so that the user can do WiFi+BLE+SSL (without PSRAM) if they need. And if they only need two out of those three items then memory is not unused/wasted/reserved for the other part that they don't need.

@jimmo
Copy link
Member Author

jimmo commented Jul 21, 2022

For that you could leave blocks 1&2

That's 73kiB total for the IDF. This means you can only have a single (non-secure) socket connected, regardless of SSL.

or better 2&3

So currently we give 1&2&3, but 1 is only 4kiB, so it's pretty similar. Split heap is just enabling that 4kiB from 1 to go to uPy.

Or could leave block 4 (112208 bytes) for the IDF and uPy could use blocks 1, 2 and 3 for its heap.

I hadn't considered this because I figured reducing the size of the uPy heap wasn't a good option. I guess you could also partially allocate ~20-25kiB of block 4 to uPy as well, which would come pretty close to the current uPy heap size.

(Although when it comes to being able to allocate the SSL buffers successfully in a potentially fragmented uPy heap, I think it's better than uPy has a single large contiguous heap? Although with the proposed SSLContext we can avoid this by forcing early alloc)

@dpgeorge
Copy link
Member

I hadn't considered this because I figured reducing the size of the uPy heap wasn't a good option

I think we need to, to make things "just work" by default.

Maybe we can optionally increase the size of the uPy heap by adding regions at runtime. This could be done automatically (eg when GC heap runs out) or explicitly by the user, eg micropython.increase_heap(...). Then if the code doesn't use BLE more of the IDF heap can be allocated for uPy GC.

So we'd need:

  • multiple GC heaps
  • ability to add GC regions at runtime (auto or explicit)
  • mbedtls allocating from the uPy heap
  • SSLContext to preallocate wrap_socket buffers

@Carglglz
Copy link
Contributor

Not sure if this could be of any help or maybe you already have looked into it, but just in case:

CONFIG_MBEDTLS_DYNAMIC_BUFFER

or

CONFIG_MBEDTLS_DYNAMIC_FREE_CONFIG_DATA
CONFIG_MBEDTLS_DYNAMIC_FREE_CA_CERT

Although not sure how much heap would it save, or if it would be enough to solve the ENOMEM errors... 🤔

Also with SSLContext it may be possible to indicate a set of cipher-suites that are less resource intensive which probably will help too.

@AmirHmZz
Copy link
Contributor

Any updates on this?

@DvdGiessen
Copy link
Sponsor Contributor

If you're currently blocked by out of memory issues when using SSL, I have a (not thoroughly tested) change in my tree, quickly committed here, that implements allocating mbedtls buffers from the MicroPython heap. Feel free to pick it for your local development build. It did solve my issues on IDF4.4, unblocking me for now. I believe Jim mentioned in #8915 he was testing with a similar change.

@jimmo
Copy link
Member Author

jimmo commented Aug 17, 2022

If you're currently blocked by out of memory issues when using SSL, I have a (not thoroughly tested) change in my tree, quickly committed here, that implements allocating mbedtls buffers from the MicroPython heap.

@DvdGiessen @AmirHmZz Yes this is exactly the approach I used, however note that there will be some calls from mbedtls that are not on the MicroPython task, in which case accessing the MicroPython heap will not work. The solution I had for this was just to put an extra layer between mbedtls and m_tracked_calloc to check the thread state:

void* mbedtls_calloc_wrapper(size_t nmemb, size_t size) {
    if (mp_thread_get_state() == NULL) {
        return calloc(nmemb, size);
    } else {
        return m_tracked_calloc(nmemb, size);
    }
}

void mbedtls_free_wrapper(void *ptr) {
    if (mp_thread_get_state() == NULL) {
        free(ptr);
    } else {
        m_tracked_free(ptr);
    }
}

then in main()

mbedtls_platform_set_calloc_free(&mbedtls_calloc_wrapper, &mbedtls_free_wrapper);

@beyonlo
Copy link

beyonlo commented Sep 7, 2022

Nevertheless, it's still possible to do all these things at once, albeit with only a little RAM to spare for the actual application (PSRAM would make things a lot better).

I think we can't to count with the PSRAM for this solution, because PSRAM is very very slow, around ~70/100ms for a gc.collect(), and some applications can't to accept that time. I'm using ESP32-S3 without PSRAM just because that GC time (even explicitly running at short intervals) is so much time for my project.

I think it'd be great if we could be as flexible as possible with RAM usage so that the user can do WiFi+BLE+SSL (without PSRAM) if they need. And if they only need two out of those three items then memory is not unused/wasted/reserved for the other part that they don't need.

Excellent! That will be great. Just bringing to my scenario as an example where I need 6 persistent SSL sockets + WiFi - no BLE. So the RAM used by BLE will not be used/reserved in my application. In this way, many users and applications will be benefit with this flexible mode.

@projectgus
Copy link
Contributor

projectgus commented Oct 6, 2022

Is it useful to have the mbedTLS wrapper allocators try to allocate from the IDF heap first, and only fall back to Python heap if the IDF heap allocation fails?

I'm thinking that although there are going to be runtime configurations where the IDF heap is full and the Python heap may be empty, there are equally likely to be some where IDF heap is relatively empty (i.e. if Bluetooth is not in use, or perhaps if PSRAM is enabled).

Means the free path gets fiddly with checking what region a pointer lands on, of course.

@jimmo
Copy link
Member Author

jimmo commented Oct 6, 2022

Is it useful to have the mbedTLS wrapper allocators try to allocate from the IDF heap first, and only fall back to Python heap if the IDF heap allocation fails?

Maybe! My thought was that the benefit of stealing back that "unused" IDF heap to give to MicroPython would be overall better.

(But yes like you say, tracking this for free will be interesting)

@projectgus
Copy link
Contributor

My thought was that the benefit of stealing back that "unused" IDF heap to give to MicroPython would be overall better.

Oh, if that's on the agenda then it makes total sense!

(Is the end result going to be replacing the IDF heap component with one that wraps all C allocations from the MP heap? 😁)

@chaseadam
Copy link

Looks like ESP-IDF v5 (beta release in August 2022) is switching to mbedTLS v3.x per https://github.com/espressif/mbedtls/wiki#mbed-tls-support-in-esp-idf

May affect the need (or approach) for this issue?

Here is the effort to migrate to mbedTLS v3.x #8988 which is on hold until after 1.20 release

@tve
Copy link
Contributor

tve commented Nov 18, 2022

When I was using mpy 1.13 I had two different builds, one with BLE and one without. I normally don't use BLE and the savings from removing it are just too big to ignore. The BLE build used:

#define MICROPY_IDF_HEAP_MIN 61440

which was picked up in main to keep a specific amt of memory for IDF.

IMHO it would be better to have a way to adjust the heap in boot.py as suggested in #6785 than to create complicated memory allocation schemes that only lead to very difficult to troubleshoot bugs later...

@MaxMyzer
Copy link

MaxMyzer commented Jan 7, 2023

Hi, Just wanted to add that the issues with running out of memory with ussl.wrap_socket also appear to happen on the RP2 port.

@abuvanth
Copy link

any update?

@chaseadam
Copy link

chaseadam commented Jan 31, 2023

looks like circuitpython 8.0 candidate includes update to 4.4. Curious how they dealt with the fragmentation:

Update ESP-IDF to latest release/v4.4 commits. #7486. Thanks @dhalbert.

There may be something in this pull request? adafruit/esp-idf#11

@dhalbert
Copy link

(belated comment)

looks like circuitpython 8.0 candidate includes update to 4.4. Curious how they dealt with the fragmentation:

I am not sure if we (CircuitPython) are doing anything special. We do have a limited BLE implementation, but almost no one uses it. Since nimble doesn't do dynamic service creation out of the box, it doesn't match our nRF-inspired API. We are considering how to improve nimble or get around that, but haven't done anything yet. So we may just not be running into the wifi+BLE storage limits right now.

I am not that familiar with the details of our Espressif wifi code. @tannewt Do you have any comment here?

@Mopele
Copy link

Mopele commented Jan 7, 2024

Are there any new developments?

@jimmo
Copy link
Member Author

jimmo commented Jan 8, 2024

Are there any new developments?

@Mopele yes. v1.21 and the recent v1.22 included several changes related to this. See https://github.com/orgs/micropython/discussions/12316

tannewt pushed a commit to tannewt/circuitpython that referenced this issue Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests