Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Panic in HTTP Power Meter #381

Closed
MalteSchm opened this issue Aug 13, 2023 · 13 comments
Closed

Kernel Panic in HTTP Power Meter #381

MalteSchm opened this issue Aug 13, 2023 · 13 comments
Labels
bug Something isn't working

Comments

@MalteSchm
Copy link

MalteSchm commented Aug 13, 2023

What happened?

OpenDTU reboots. Most likely in Power Meter class

To Reproduce Bug

Unclear

Expected Behavior

No reboots

Install Method

Self-Compiled

What git-hash/version of OpenDTU?

9f161a3

Relevant log/trace output

16:37:12.002 > [PowerLimiterClass::loop] ******************* ENTER **********************
16:37:12.010 > [PowerLimiterClass::calcNextInverterRestart] Localtime read 16 37 / configured RestartHour 12
16:37:12.018 > [PowerLimiterClass::calcNextInverterRestart] dayMinutes 997 / targetMinutes 720
16:37:12.026 > [PowerLimiterClass::calcNextInverterRestart] next inverter restart in 1163 minutes
16:37:12.033 > [PowerLimiterClass::calcNextInverterRestart] _nextInverterRestart @ 69794700 millis
16:37:12.044 > [PowerLimiterClass::loop] dcVoltage: 50.00 Voltage Start Threshold: 49.00 Voltage Stop Threshold: 48.90 inverter->isProducing(): 1
16:37:12.055 > [HttpPowerMeter] Get
16:37:12.318 > Websocket: [/livedata][2] connect
16:37:12.401 > [HttpPowerMeter] Ok
16:37:12.407 > [HttpPowerMeter] size 1239[HttpPowerMeter] End
16:37:12.422 > [HttpPowerMeter] End2
16:37:12.424 > Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
16:37:12.432 > 
16:37:12.432 > Core  1 register dump:
16:37:12.434 > PC      : 0x40120946  PS      : 0x00060830  A0      : 0x800d8d05  A1      : 0x3ffb0d60  
16:37:12.442 > A2      : 0x3ffb0e6c  A3      : 0x00000001  A4      : 0x3ffe3454  A5      : 0x3ffb81e8  
16:37:12.450 > A6      : 0x3ffcaf10  A7      : 0x3ffb0d1c  A8      : 0x80125abc  A9      : 0x3ffb0d40  
16:37:12.457 > A10     : 0x3ffdd6f0  A11     : 0x3f432581  A12     : 0x000004d7  A13     : 0x3ffb0bc0  
16:37:12.466 > A14     : 0x000004d7  A15     : 0x0000000c  SAR     : 0x00000011  EXCCAUSE: 0x0000001c  
16:37:12.473 > EXCVADDR: 0x0000003d  LBEG    : 0x4008c069  LEND    : 0x4008c079  LCOUNT  : 0xffffffff  
16:37:12.481 > 
16:37:12.481 > 
16:37:12.481 > Backtrace: 0x40120943:0x3ffb0d60 0x400d8d02:0x3ffb0d80 0x400d9805:0x3ffb1150 0x400e4f81:0x3ffb2150 0x400e4fd9:0x3ffb2170 0x400e3e21:0x3ffb2190 0x400e4c03:0x3ffb21c0 0x4010185c:0x3ffb2270 0x40128d85:0x3ffb2290
16:37:12.863 > 
16:37:12.863 >   #0  0x40120943:0x3ffb0d60 in HTTPClient::~HTTPClient() at C:/Users/malte/.platformio/packages/framework-arduinoespressif32/libraries/HTTPClient/src/HTTPClient.cpp:109
16:37:12.863 >   #1  0x400d8d02:0x3ffb0d80 in HttpPowerMeterClass::httpRequest(char const*, Auth, char const*, char const*, char const*, char const*, unsigned int, char*, unsigned int, char*, unsigned int) at src/HttpPowerMeter.cpp:58
16:37:12.863 >   #2  0x400d9805:0x3ffb1150 in HttpPowerMeterClass::updateValues() at src/HttpPowerMeter.cpp:37
16:37:12.863 >   #3  0x400e4f81:0x3ffb2150 in PowerMeterClass::readPowerMeter() at src/PowerMeter.cpp:188
16:37:12.863 >   #4  0x400e4fd9:0x3ffb2170 in PowerMeterClass::getPowerTotal(bool) at src/PowerMeter.cpp:105
16:37:12.863 >   #5  0x400e3e21:0x3ffb2190 in PowerLimiterClass::calcPowerLimit(std::shared_ptr<InverterAbstract>, bool, bool) at src/PowerLimiter.cpp:421
16:37:12.863 >   #6  0x400e4c03:0x3ffb21c0 in PowerLimiterClass::loop() at src/PowerLimiter.cpp:293
16:37:12.863 >   #7  0x4010185c:0x3ffb2270 in loop() at src/main.cpp:209
16:37:12.863 >   #8  0x40128d85:0x3ffb2290 in loopTask(void*) at C:/Users/malte/.platformio/packages/framework-arduinoespressif32/cores/esp32/main.cpp:50
16:37:12.863 > 
16:37:12.864 > 
16:37:12.864 > 
16:37:12.864 > 
16:37:12.864 > ELF file SHA256: e5563cec9168a6e8
16:37:12.864 > 
16:37:12.864 > E (15388) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0
16:37:12.864 > Rebooting...
16:37:12.864 > ets Jul 29 2019 12:21:46
16:37:12.864 > 
16:37:12.864 > rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
16:37:12.864 > configsip: 0, SPIWP:0xee
16:37:12.864 > clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
16:37:12.864 > mode:DIO, clock div:2
16:37:12.864 > load:0x3fff0030,len:1184
16:37:12.864 > load:0x40078000,len:13192
16:37:12.864 > load:0x40080400,len:3028
16:37:12.864 > entry 0x400805e4
16:37:13.214 > E (700) esp_core_dump_flash: No core d f   partition found!
16:37:13.219 > E (700) esp_core_dump_flash: No core dump partition found!
16:37:13.242 > 
16:37:13.243 > Starting OpenDTU

Anything else?

I'm seeking help / ideas here. While testing a newly implemented feature (see * below) I noticed reboots related to reading the HTTP Power Meter. Since these should not be connected to the changes I did I have disabled my Code and the issue still exists.

I have added some print statements to verify that I'm getting HTTP OK and a response (1239 byte). I copied a code snippet with these changes for reference. The printouts are present in the shared log.

(*)
The feature I'm adding is an improved Huawei CAN bus communication to address #316. This implements CAN bus communication as a separate thread so that all values send over the bus are actually captured. Above panic was captured when the Huawei PSU was disabled using the GUI. This means that the Huawei device is not initialized, the thread is not started and that the code related to the Huawei PSU is not executed

 <snip>
   if (httpCode == HTTP_CODE_OK) {
        MessageOutput.println("[HttpPowerMeter] Ok");

        String responseBody = httpClient.getString();

        if (responseBody.length() > (responseSize - 1)) {
            MessageOutput.println("[HttpPowerMeter] large");
            snprintf_P(error, errorSize, "Response too large! Response length: %d Body start: %s",
                httpClient.getSize(), responseBody.c_str());
        } else {
            MessageOutput.printf("[HttpPowerMeter] size %i", responseBody.length());
            snprintf(response, responseSize, responseBody.c_str());
        }
    } else if (httpCode <= 0) {
        snprintf_P(error, errorSize, "Error(%s): %s", newUrl.c_str(), httpClient.errorToString(httpCode).c_str());
    } else if (httpCode != HTTP_CODE_OK) {
        snprintf_P(error, errorSize, "Bad HTTP code: %d", httpCode);
    }

    MessageOutput.println("[HttpPowerMeter] End");

    httpClient.end();
    delete wifiClient;
    MessageOutput.println("[HttpPowerMeter] End2");

    if (error[0] != '\0') {
        return false;
    }

    return true;
}
@MalteSchm MalteSchm added the bug Something isn't working label Aug 13, 2023
@schlimmchen
Copy link
Collaborator

Oh dear lord... I spent around three hours with this already. While I found it very interesting, it's also quite frustrating as my thoughts are inconclusive.

Use-After-Free

I am sure to have found a "use-after-free":

  • HttpPowerMeterClass::httpRequest allocates a pointer on the stack that points to a WiFiClient. It also allocates an HTTPClient on the stack.
  • The WiFiClient is constructed using new.
  • The HTTPClient::begin() function is called. It sets its _client to point to the WiFiClient instance constructed earlier. _client is no longer nullptr.
  • The WiFiClient is later destructed using delete.
  • The scope (the whole function HttpPowerMeterClass::httpRequest) is left, hence HTTPClient::~HTTPClient is executed. That instance has a _client that is non-null due to the use of begin(), even if begin() failed***. _client is now non-null but points to deleted memory. The HTTPClient destructor tests _client and calls _client->stop().

*** This is a huge issue in my eyes. They even assume that _client is nullptr in case ::begin() failed, they say so in ::connect() themselves: if (!_client) { log_d("HTTPClient::begin was not called or returned error"); [...], yet they do not take care to reset _client if ::begin() does indeed fail.

However, even though I am certain to have walked the respective program path, this does not trigger the issue you mentioned. And it makes sense. The "use-after-free" and your issue are unrelated, as I have to realize.

I am leaving this Python HTTP server for reference here:

import socket
import json

def start_server(port, response, content):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        s.bind(('0.0.0.0', port))
        s.listen()
        while True:
            conn, addr = s.accept()
            with conn:
                print('Connected by', addr)
                request = ""
                while not request.endswith("\r\n\r\n"):
                    chunk = conn.recv(16).decode()  # Adjust the chunk size as needed
                    if not chunk: break
                    request += chunk

                full_response = response.format(content_length=len(content)) + content
                print("Request received, sending response...")
                conn.sendall(full_response.encode())
                conn.shutdown(socket.SHUT_WR) # Ensures data is on the wire before closing
                conn.close()

port = 65432
response_headers = "HTTP/1.1 200 OK\r\nContent-Length: {content_length}\r\nContent-Type: application/json\r\n\r\n"
content = json.dumps({"total_power": 123})
start_server(port, response_headers, content)

The goal was to immediately close the TCP connection after sending the response, such that I was sure that HTTPClient::disconnect would leave the _client of its instance untouched, causing the use-after-free in its d'tor later.

I am certain of this because I enabled the debug messages of HTTPClient by using build_flags = -DCORE_DEBUG_LEVEL=5 in platformio.ini. I see the message tcp is closed before the HTTPClient destructor is called, which tells me that the _client member is non-nullptr.

Mitigation options

  1. Allocate WiFiClient on the stack rather than the heap and allocate it before HTTPClient, so it is cleaned up in reverse order. That will break as soon as someone rearranges the allocation order because who should know that the order is of such importance. Also, allocating on the stack might not be a good option due to the size of WiFiClient. I don't know... Probably the WiFiClient is allocated on the heap for a reason.
  2. Allocate WiFiClient using a std::unique_ptr and do that before allocating the HTTPClient. This removes one of the shortcomings of the previous idea: The WiFiClient is allocated on the heap. We still have to make sure that nobody re-arranges the order of allocation (first the std::uniqe_ptr, then the HTTPClient).
  3. Call HTTPClient::end() if HTTPClient::begin() failed. Not a reliable option, as there are program paths that do not reset _client to nullptr.
  4. Accepting that we have to manage WiFiClient's lifetime -- even though it should be HTTPClient responsibility to manage it's _client's lifetime, e.g. by requesting a std::shared_ptr<WiFiClient>. That means: Do 2 and put a big fat warning around the allocation of the std::unique_ptr. See the respective comment in the HTTPClient header:
/*
 * Since both begin() functions take a reference to client as a parameter, you need to 
 * ensure the client object lives the entire time of the HTTPClient
 */

Your issue

Here is what I can contribute:

  • It is odd to me that the backtrace points to src/HttpPowerMeter.cpp:58, where the HTTPClient is allocated/constructed on the stack and then the backtrace points to the destructor of HTTPClient (rather than it's c'tor). I guess this is normal and indicates that the cleanup is happening right before the variable goes out of scope.
  • HTTPClient/src/HTTPClient.cpp:109 is an if-statement: if (_client). That should not be causing problems unless the memory from which the pointer itself is read is out-of-bounds. This seems to be the case exactly: EXCVADDR: 0x0000003d indicates that the read happens from 0x3d. This is probably not the offset of the _client member within HTTPClient because 61 Bytes is not be aligned to 32 Bits. So what is this number/offset?
  • Does this mean that this is nullptr when the HTTPClient destructor is running? How is that possible?

I was unable to reproduce this bug. Can you reproduce this bug consistently or at least with a high probability and within a reasonable timespan, so we can investigate a possible fix?

My best idea is that there is some kind of stack corruption going on.

A stack overflow is a candidate. However, there seem to be checks in place that detect stack overflows, so we should see a respective error if that was the case. From the backtrace we see that around 5.4k of stack is in use while the stack size should be 8k, so there should still be room. The stack use really should be optimized, see below, but I don't see that a stack overflow is the reason for the original error.

Also, these messages I put in the code strongly suggest that no stack overflow is occuring:

readPowerMeter: uxTaskGetStackHighWaterMark( NULL ) = 5080
updateValues: uxTaskGetStackHighWaterMark( NULL ) = 3420
httpRequest: uxTaskGetStackHighWaterMark( NULL ) = 2444

I don't see that the program corrupts its own stack by out-of-bound writes.

If I am correct about the stack corruption, the corruption might be caused by another task. That would mean that your error does only occur by chance and that the corruption would need to happen while HttpPowerMeterClass::httpRequest is running. Do you see other types of backtraces and core dumps or are they the same?

Although I am intrigued by this, I have to give up investigating this, so I make progress with my own features.

Other issues

  • In HttpPowerMeterClass::httpRequest
    • response and error are used without checking for nullptr.
    • response and error are written without checking whether they can accomodate at least one byte.
    • httpHeader and httpHeader are used without testing for nullptr.
    • String response shadows char* response.
  • In HttpPowerMeterClass::updateValues()
    • Allocating char response[2000] on the stack is insane. Even char error[256] is not ideal. That should be dynamic memory allocated on the heap using a std::unique_ptr.

@MalteSchm
Copy link
Author

Thanks @schlimmchen that is a thorough writeup. 👍 I will review and investigate. I'm still seeing these re-starts but less often then when I filed this bug. I suspect that is has to do with my repeater and getting data into the garage.

@MalteSchm
Copy link
Author

I finally had some time.

Looking into this I thought I might just disable creating the WifiClient in httpRequest for now. This breaks making https requests which I don't use. So for testing it is not a problem. If my issue is related to accessing deleted WifiClient from HttpClient this should increase stability. I suppose this makes HttpClient handle this internally.

The other issue I (believe I) found was that HttpPowerMeterClass::updateValues() parses the json response (getFloatValueByJsonPath) also in cases where the HTTP Request failed (httpRequest).

@schlimmchen
Copy link
Collaborator

I suspect that is has to do with my repeater and getting data into the garage.

Hm, so you think that intermittent WiFi Connection Problems might trigger the issue?

The other issue I (believe I) found was that HttpPowerMeterClass::updateValues() parses the json response
(getFloatValueByJsonPath) also in cases where the HTTP Request failed (httpRequest).

Yes, that's not ideal. httpRequest does set the first Byte to \0 in the response, so it will appear empty even after a failed request. It's certainly bad style. HttpPowerMeterClass::updateValues should just return false as soon as something goes sideways.

@MalteSchm
Copy link
Author

Hm, so you think that intermittent WiFi Connection Problems might trigger the issue?

Yes that is my guess.

I tested commenting out all WifiClient related code in httpRequest this week. This makes HttpClient manage the WifiClient directly but it has the downside of breaking HTTPS support. This test resulted in a significant stability improvement. With the original code I had an reboot approximately every 8 hours. With this change I only saw one reboot in the last 5 days. This was caused by a different issue (some TCP related assertion). Serial logging was enabled all the time.

Just now I looked at how to realize your idea using std::unique_ptr and to bring https back. I came up with this change here: https://github.com/MalteSchm/OpenDTU-OnBattery/tree/http_power_meter_fixes
I thought it would be simplest to drop managing the WifiClient for HTTP calls, so I only create a secureWifiClient when required for HTTPS

@schlimmchen If you have some feedback regarding this change please let me know. My knowledge of these C++ pointers is limited / nonexistent.

Right now I'm testing this but my shelly does not support HTTPS. Is there anybody who could give this a try with HTTPS (@helgeerbe : do you have an idea who could help here?)

@schlimmchen
Copy link
Collaborator

Interesting progress!

So you let HttpClient manage the WifiClient itself, which improves the situation. However, you suggest to still manage the WifiClientSecure by the power meter implementation to bring back HTTPS support. I understand that, but doesn't that just move the problem to the case where HTTPS is in use?

My feedback on your code:

  • You use secureWifiClient = std::unique_ptr<WiFiClientSecure>(new WiFiClientSecure ()); to create the HTTPS wifi client. That line works, but it's really not how it is supposed to be done. Have a look at this: secureWifiClient = std::make_unique<WiFiClientSecure>();. In particular you avoid the use of new, which should have no place in modern higher-level C++ code.
  • Maybe throw #include <memory> in there to be explicit.

Would you be willing to do another week of testing with a firmware where both WifiClient and WifiClientSecure are managed by a std::unique_ptr? If the problem is gone then as well, we learned that indeed the lifetime of the WiFiClient was the issue. Otherwise we risk fixing the issue only for HTTP, not HTTPS.

@MalteSchm
Copy link
Author

MalteSchm commented Sep 2, 2023

Thanks a lot for the feedback. I implemented the code below now and will let that run for some time.
What puzzles me is that the original code used a single pointer to wifiClient and then called a secureWifiClient method after casting. secureWifiClient is derived from wifiClient but why is this method there in this case?
I wonder if this should also be done when using unique_ptr

I tested using secureWifiClient also for http calls but this did not work.

    // secureWifiClient MUST be created before HTTPClient
    // see discussion: https://github.com/helgeerbe/OpenDTU-OnBattery/issues/381
    std::unique_ptr<WiFiClient> wifiClient;
    std::unique_ptr<WiFiClientSecure> secureWifiClient;

    HTTPClient httpClient;
    if (urlProtocol == "https") {
      secureWifiClient = std::make_unique<WiFiClientSecure>();
      secureWifiClient->setInsecure();
      if (!httpClient.begin(*secureWifiClient, newUrl)) {
        snprintf_P(error, errorSize, "httpClient.begin(%s) failed", newUrl.c_str());
        return false;
      }
    } else {
      wifiClient = std::make_unique<WiFiClient>();
      if (!httpClient.begin(*wifiClient, newUrl)) {
        snprintf_P(error, errorSize, "httpClient.begin(%s) failed", newUrl.c_str());
        return false;
      }
    }

@schlimmchen
Copy link
Collaborator

Thanks a lot for the feedback.

You are very welcome, I enjoy this discussion!

Your code works fine, I still wanted to show you a way to avoid some of the code duplication and the second std::unique_ptr. I think this answers a part of your questions:

    std::unique_ptr<WiFiClient> wifiClient;
    if (urlProtocol == "https") {
        auto secureWifiClient = std::make_unique<WiFiClientSecure>();
        secureWifiClient->setInsecure();
        wifiClient = std::move(secureWifiClient);
    } else {
        wifiClient = std::make_unique<WiFiClient>();
    }

    HTTPClient httpClient;
    if (!httpClient.begin(*wifiClient, newUrl)) {
        snprintf_P(error, errorSize, "httpClient.begin(%s) failed", newUrl.c_str());
        return false;
    }

What puzzles me is that the original code used a single pointer to wifiClient and then called a secureWifiClient method after casting.

I am unsure about what is puzzling you. So I will just explain what these lines do and hope that it answers your question:

    WiFiClient* wifiClient = NULL;
    wifiClient = new WiFiClientSecure;
    reinterpret_cast<WiFiClientSecure*>(wifiClient)->setInsecure();

wifiClient is a pointer to the base class WiFiClient. It is initialized to point to a new WiFiClientSecure. However, only methods of the base class are accessible through the variable wifiClient. setInsecure() is only part of WiFiClientSecure, not WiFiClient (I did not check, but I am highly confident that I am right about this). For that reason, the pointer is cast to point to the derived type, so setInsecure() can be called. This cast would fail or even cause undefined behavior if the pointer did not actually point to an instance of the respective derived type, but as this is indeed the case, there are no worries.

I avoided this problem by creating a std::unique_ptr<WiFiClientSecure> using std::make_unique<...>(), which already points to the derived type. setInsecure() is called, and then the pointer is moved to the std::unique_ptr that points to the base class. That's implicitly possible and always safe. And it works because we later only need a pointer to the instance of the base class.

secureWifiClient is derived from wifiClient but why is this method there in this case?

Did I answer that? setInsecure() is only part of WiFiClientSecure, not WiFiClient.

@MalteSchm
Copy link
Author

Did I answer that? setInsecure() is only part of WiFiClientSecure, not WiFiClient.

Thanks you did. I misread the code thinking that we create an instance of WifiClient here.
I made the suggested implementation and will leave this running for some time now.

@MalteSchm
Copy link
Author

Hi @schlimmchen

This ran for 5 days now with no unexpected reboots. I created a PR here: #430

@schlimmchen
Copy link
Collaborator

Nice 💪

@MalteSchm
Copy link
Author

Uptime 10days and PR merged. Closing

Copy link

github-actions bot commented Apr 4, 2024

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants