-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel Panic in HTTP Power Meter #381
Comments
Oh dear lord... I spent around three hours with this already. While I found it very interesting, it's also quite frustrating as my thoughts are inconclusive. Use-After-FreeI am sure to have found a "use-after-free":
*** This is a huge issue in my eyes. They even assume that However, even though I am certain to have walked the respective program path, this does not trigger the issue you mentioned. And it makes sense. The "use-after-free" and your issue are unrelated, as I have to realize. I am leaving this Python HTTP server for reference here: import socket
import json
def start_server(port, response, content):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('0.0.0.0', port))
s.listen()
while True:
conn, addr = s.accept()
with conn:
print('Connected by', addr)
request = ""
while not request.endswith("\r\n\r\n"):
chunk = conn.recv(16).decode() # Adjust the chunk size as needed
if not chunk: break
request += chunk
full_response = response.format(content_length=len(content)) + content
print("Request received, sending response...")
conn.sendall(full_response.encode())
conn.shutdown(socket.SHUT_WR) # Ensures data is on the wire before closing
conn.close()
port = 65432
response_headers = "HTTP/1.1 200 OK\r\nContent-Length: {content_length}\r\nContent-Type: application/json\r\n\r\n"
content = json.dumps({"total_power": 123})
start_server(port, response_headers, content) The goal was to immediately close the TCP connection after sending the response, such that I was sure that I am certain of this because I enabled the debug messages of Mitigation options
Your issueHere is what I can contribute:
I was unable to reproduce this bug. Can you reproduce this bug consistently or at least with a high probability and within a reasonable timespan, so we can investigate a possible fix? My best idea is that there is some kind of stack corruption going on. A stack overflow is a candidate. However, there seem to be checks in place that detect stack overflows, so we should see a respective error if that was the case. From the backtrace we see that around 5.4k of stack is in use while the stack size should be 8k, so there should still be room. The stack use really should be optimized, see below, but I don't see that a stack overflow is the reason for the original error. Also, these messages I put in the code strongly suggest that no stack overflow is occuring:
I don't see that the program corrupts its own stack by out-of-bound writes. If I am correct about the stack corruption, the corruption might be caused by another task. That would mean that your error does only occur by chance and that the corruption would need to happen while Although I am intrigued by this, I have to give up investigating this, so I make progress with my own features. Other issues
|
Thanks @schlimmchen that is a thorough writeup. 👍 I will review and investigate. I'm still seeing these re-starts but less often then when I filed this bug. I suspect that is has to do with my repeater and getting data into the garage. |
I finally had some time. Looking into this I thought I might just disable creating the WifiClient in httpRequest for now. This breaks making https requests which I don't use. So for testing it is not a problem. If my issue is related to accessing deleted WifiClient from HttpClient this should increase stability. I suppose this makes HttpClient handle this internally. The other issue I (believe I) found was that HttpPowerMeterClass::updateValues() parses the json response (getFloatValueByJsonPath) also in cases where the HTTP Request failed (httpRequest). |
Hm, so you think that intermittent WiFi Connection Problems might trigger the issue?
Yes, that's not ideal. |
Yes that is my guess. I tested commenting out all WifiClient related code in httpRequest this week. This makes HttpClient manage the WifiClient directly but it has the downside of breaking HTTPS support. This test resulted in a significant stability improvement. With the original code I had an reboot approximately every 8 hours. With this change I only saw one reboot in the last 5 days. This was caused by a different issue (some TCP related assertion). Serial logging was enabled all the time. Just now I looked at how to realize your idea using @schlimmchen If you have some feedback regarding this change please let me know. My knowledge of these C++ pointers is limited / nonexistent. Right now I'm testing this but my shelly does not support HTTPS. Is there anybody who could give this a try with HTTPS (@helgeerbe : do you have an idea who could help here?) |
Interesting progress! So you let HttpClient manage the WifiClient itself, which improves the situation. However, you suggest to still manage the WifiClientSecure by the power meter implementation to bring back HTTPS support. I understand that, but doesn't that just move the problem to the case where HTTPS is in use? My feedback on your code:
Would you be willing to do another week of testing with a firmware where both WifiClient and WifiClientSecure are managed by a |
Thanks a lot for the feedback. I implemented the code below now and will let that run for some time. I tested using secureWifiClient also for http calls but this did not work.
|
You are very welcome, I enjoy this discussion! Your code works fine, I still wanted to show you a way to avoid some of the code duplication and the second std::unique_ptr<WiFiClient> wifiClient;
if (urlProtocol == "https") {
auto secureWifiClient = std::make_unique<WiFiClientSecure>();
secureWifiClient->setInsecure();
wifiClient = std::move(secureWifiClient);
} else {
wifiClient = std::make_unique<WiFiClient>();
}
HTTPClient httpClient;
if (!httpClient.begin(*wifiClient, newUrl)) {
snprintf_P(error, errorSize, "httpClient.begin(%s) failed", newUrl.c_str());
return false;
}
I am unsure about what is puzzling you. So I will just explain what these lines do and hope that it answers your question: WiFiClient* wifiClient = NULL;
wifiClient = new WiFiClientSecure;
reinterpret_cast<WiFiClientSecure*>(wifiClient)->setInsecure();
I avoided this problem by creating a
Did I answer that? |
Thanks you did. I misread the code thinking that we create an instance of WifiClient here. |
Hi @schlimmchen This ran for 5 days now with no unexpected reboots. I created a PR here: #430 |
Nice 💪 |
Uptime 10days and PR merged. Closing |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns. |
What happened?
OpenDTU reboots. Most likely in Power Meter class
To Reproduce Bug
Unclear
Expected Behavior
No reboots
Install Method
Self-Compiled
What git-hash/version of OpenDTU?
9f161a3
Relevant log/trace output
Anything else?
I'm seeking help / ideas here. While testing a newly implemented feature (see * below) I noticed reboots related to reading the HTTP Power Meter. Since these should not be connected to the changes I did I have disabled my Code and the issue still exists.
I have added some print statements to verify that I'm getting HTTP OK and a response (1239 byte). I copied a code snippet with these changes for reference. The printouts are present in the shared log.
(*)
The feature I'm adding is an improved Huawei CAN bus communication to address #316. This implements CAN bus communication as a separate thread so that all values send over the bus are actually captured. Above panic was captured when the Huawei PSU was disabled using the GUI. This means that the Huawei device is not initialized, the thread is not started and that the code related to the Huawei PSU is not executed
The text was updated successfully, but these errors were encountered: