Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT connection makes the ESP8266 unresponsive #847

Closed
ymoona opened this issue Feb 11, 2018 · 45 comments
Closed

MQTT connection makes the ESP8266 unresponsive #847

ymoona opened this issue Feb 11, 2018 · 45 comments
Labels
Type: Bug Considered a bug
Milestone

Comments

@ymoona
Copy link

ymoona commented Feb 11, 2018

NOTE: This is not a support forum! For questions and support go here: https://www.letscontrolit.com/forum/viewforum.php?f=1

Steps to reproduce

How can we trigger this problem?
Configure a controller to OpenHAB MQTT, I have no password/user
Enable the controller, once connected (check the log) esp8266 becomes unresponsive.

Part of the console log of a ping:
64 bytes from 192.168.1.13: icmp_seq=1090 ttl=128 time=5.293 ms
64 bytes from 192.168.1.13: icmp_seq=1091 ttl=128 time=5.469 ms
64 bytes from 192.168.1.13: icmp_seq=1092 ttl=128 time=5.381 ms
64 bytes from 192.168.1.13: icmp_seq=1093 ttl=128 time=5.193 ms
64 bytes from 192.168.1.13: icmp_seq=1094 ttl=128 time=35.700 ms
64 bytes from 192.168.1.13: icmp_seq=1095 ttl=128 time=5.033 ms
Request timeout for icmp_seq 1096
Request timeout for icmp_seq 1097
Request timeout for icmp_seq 1098
Request timeout for icmp_seq 1099
Request timeout for icmp_seq 1100
Request timeout for icmp_seq 1101
Request timeout for icmp_seq 1102
Request timeout for icmp_seq 1103
Request timeout for icmp_seq 1104
Request timeout for icmp_seq 1105
Request timeout for icmp_seq 1106
Request timeout for icmp_seq 1107
64 bytes from 192.168.1.13: icmp_seq=1108 ttl=128 time=150.382 ms
Request timeout for icmp_seq 1109
Request timeout for icmp_seq 1110
Request timeout for icmp_seq 1111
Request timeout for icmp_seq 1112
Request timeout for icmp_seq 1113
Request timeout for icmp_seq 1114
Request timeout for icmp_seq 1115
Request timeout for icmp_seq 1116
64 bytes from 192.168.1.13: icmp_seq=1117 ttl=128 time=201.001 ms

First it was fine, the first timeout is just after I start the MQTT broker on server side. I can clearly reproduce it by stopping or starting the MQTT broker.

Does the problem presist after powering off and on? (just resetting isnt enough sometimes)
Power it on with MQTT broker running and configured makes the ESP8266 very unresponsive.
Power it on with MQTT broker NOT running gives me a perfect responsive ESP8266

Expected behavior

I would like to use MQTT and have a responsive system

Actual behavior

Tell us what happens instead?
When MQTT client is connected the system is highly unresponsive

System configuration

Hardware:
Sonoff basic

Software or git version:
ESP_Easy_v2.0-20180209_normal_ESP8266_1024

@TD-er
Copy link
Member

TD-er commented Feb 11, 2018

What rules do you have active?
And what plugins use the MQTT controller?
And at what position do you have the enabled MQTT controller? (I recently changed something that should allow the MQTT controller to be used at position 2 or 3, but maybe introduced some issues????)

@ymoona
Copy link
Author

ymoona commented Feb 11, 2018

For now I did not have any rules or plugins configured.
The controller was configured at the first entry

@TD-er
Copy link
Member

TD-er commented Feb 11, 2018

Hmm that sounds like reproducible :)
I do not have OpenHAB MQTT, but Mosquito on Domoticz.
Let's see if that is also causing WiFi issues. (and perhaps CPU load)

@ymoona
Copy link
Author

ymoona commented Feb 11, 2018

I also have espeasy running a node MCU v3, that works like charm. But that runs a different version, a few weeks earlier, but I have some troubles downgrading the firmware on my sonoff. Now it wont come back at all :-(

@TD-er
Copy link
Member

TD-er commented Feb 11, 2018

You could erase all flash content using ESPtool.py
Or disable the controller using serial?

I had something similar a few weeks ago, that's why I created this issue: #791
I guess I should do that one first :)

@ymoona
Copy link
Author

ymoona commented Feb 11, 2018

Ok, I did some more research.
On net power it becomes unresponsive with the controller for MQTT enabled.
But with USB power it behaves much better.

USB powered, controller enabled.
28 packets transmitted, 25 packets received, 10.7% packet loss
round-trip min/avg/max/stddev = 8.958/82.199/557.475/121.977 ms

USB powered, controller disabled.
36 packets transmitted, 35 packets received, 2.8% packet loss
round-trip min/avg/max/stddev = 2.405/33.548/174.722/40.889 ms

So on USB is it still responsive...

@ymoona
Copy link
Author

ymoona commented Feb 11, 2018

oh btw,
I did a erase_flash using ESPtool.py, but still the firmware ESP_Easy_v2.0-20180130_normal_ESP8266_1024.bin wont give me an AP to connect to, also serial is silent.

@TD-er
Copy link
Member

TD-er commented Feb 11, 2018

Hmm als tried earlier versions?
Sounds a bit like a device hardware issue?

@ymoona
Copy link
Author

ymoona commented Feb 11, 2018

I agree that the issue sounds a bit vage.
I just now finally got the 20180130_normal_ESP8266_1024 firmware flashed. Then I did not have an SSID to connect to. But using serial I could configure it, and I got it connected to my wifi. But it did not respond to ping, and also the webserver was not responding. Now back to ESP_Easy_v2.0-20180209_normal_ESP8266_1024.

That works fine until it connector to my MQTT broker.

I have also tried enabling NTP, but that did not make a change.

@TD-er
Copy link
Member

TD-er commented Feb 11, 2018

OK, so there is still something weird with the (OpenHAB???) MQTT controller.
Do you set DNS or IP for the controller?

@ymoona
Copy link
Author

ymoona commented Feb 11, 2018

It is set by IP, port default and no username or password. Rest of settings also default.
To exclude hardware issues, I'll retest this week with different sonoff basic

@TD-er
Copy link
Member

TD-er commented Feb 11, 2018

Perhaps also on a NodeMCU?

And I will have a look at the CPU load wrt to the MQTT controllers later this evening.

@ymoona
Copy link
Author

ymoona commented Feb 11, 2018

Yes will test that as well.

@TD-er
Copy link
Member

TD-er commented Feb 11, 2018

Hmm there is some really fishy code in the OpenHAB controller plugin:

ESPEasy/src/_C005.ino

Lines 72 to 79 in aec539b

// in case of event, store to buffer and return...
String command = parseString(cmd, 1);
if (command == F("event"))
eventBuffer = cmd.substring(6);
else if
(PluginCall(PLUGIN_WRITE, &TempEvent, cmd));
else
remoteConfig(&TempEvent, cmd);

I think I know what is meant here and I hope it is what the compiler will do.
I will try to get some information on what this receiving data should look like.
Can you perhaps add a print statement to the log, to get more information on what it receives?
What are the values of event->String1 and event->String2 ?

Also there is no check in the code if the plugin is enabled.

@ymoona
Copy link
Author

ymoona commented Feb 11, 2018

Ik can check that, but that will take me few days to get back to you.

@tedenda
Copy link
Contributor

tedenda commented Feb 11, 2018

MQTT : Connected to broker with client ID: ESPClient_60:01:94:0E:65:90
Subscribed to: /Varmepump2/#
SYS : 20608.00
MQTT : Connected to broker with client ID: ESPClient_60:01:94:0E:65:90
Subscribed to: /Varmepump2/#
OpenHAB String1 = /Varmepump2/SysInfo/Uptime
OpenHAB String2 = 1.00

SYS : 2.00
MQTT : Connected to broker with client ID: ESPClient_60:01:94:0E:65:90
Subscribed to: /Varmepump2/#
SYS : 100.00
MQTT : Connected to broker with client ID: ESPClient_60:01:94:0E:65:90
Subscribed to: /Varmepump2/#
SYS : 20632.00
MQTT : Connected to broker with client ID: ESPClient_60:01:94:0E:65:90

Subscribed to: /Varmepump2/#
OpenHAB String1 = /Varmepump2/SysInfo/CPU
OpenHAB String2 = 100.00

SYS : 2.00
MQTT : Connected to broker with client ID: ESPClient_60:01:94:0E:65:90
Subscribed to: /Varmepump2/#

TD-er added a commit to TD-er/ESPEasy that referenced this issue Feb 11, 2018
Re-implementation of PR letscontrolit#482 which was done on Mega branch.
Just to keep merges simple between branches. (and improve readabilit)

Also added a simple check to C005 (OpenHAB MQTT) on receiving messages. (see letscontrolit#847 )
@TD-er
Copy link
Member

TD-er commented Feb 11, 2018

I added a very basic test to the OpenHab controller to at least ignore incoming messages when not enabled.
So for starters the CPU load should be lower when OpenHab is not enabled.
But still there has to be something else. This isn't causing the high CPU load.

@TD-er
Copy link
Member

TD-er commented Feb 12, 2018

This appears to be a bigger issue than first thought and may affect all MQTT based controllers.

It looks like the MQTT controller is trying too hard to (re)connect and thus freezes the ESP right after booting.
This happens with Domoticz as well as OpenHAB MQTT.

This reconnect behavior has changed quite recently in "[issue #728] Proper reconnect for MQTT controllers (#737)"
Maybe adding some kind of delay when connecting wasn't successful?
Such a delay will also call to run background processes.

@uzi18
Copy link
Contributor

uzi18 commented Feb 12, 2018 via email

@TD-er
Copy link
Member

TD-er commented Feb 12, 2018

Maybe a little too early, but at least for my ESP my fix seems to work. Daniel is trying on his version with OpenHAB MQTT issues.....

@psy0rz
Copy link
Member

psy0rz commented Feb 13, 2018

ill try to add the rest of the controllers to the regression test ASAP.

psy0rz pushed a commit that referenced this issue Feb 13, 2018
)

Re-implementation of PR #482 which was done on Mega branch.
Just to keep merges simple between branches. (and improve readabilit)

Also added a simple check to C005 (OpenHAB MQTT) on receiving messages. (see #847 )
@mattlward
Copy link

I found a note elsewhere that with a Sonoff one needs to disable serial logging for the unit to maintain stability with a Sonoff Basic. I have not Basic's running MQTT, but I have POW's and they get sluggish on the wire with serial debugging on, but never stop.

@TD-er
Copy link
Member

TD-er commented Feb 13, 2018

Hmm, that sounds like something worth investigating.
Maybe we can peek at how full the serial buffer is, when sending logs to serial?

@ymoona
Copy link
Author

ymoona commented Feb 13, 2018

I did a quick test by disable the serial connection, and enable the mqtt controller. It works! When I enable the serial connection again, it immediately becomes unresponsive.

@TD-er
Copy link
Member

TD-er commented Feb 13, 2018

Hmm, if someone else could reproduce, would be nice.
Then I'll make a new issue to automatically stop sending logs when not being read from the serial.

@Grovkillen
Copy link
Member

I disabled serial log on a unit that is sometimes hard to web connect to. The mem usage dropped a lot (free 12500 before, free 20500 after).

@TD-er
Copy link
Member

TD-er commented Feb 16, 2018

@ymoona Yes, this pull request :)

psy0rz pushed a commit that referenced this issue Feb 16, 2018
* [issue #869] Added 'LWT' to last will topic and improved CPU load

See #869 for discussion on Last Will Topic.
Also changed the way it tried to reconnect to make it return a lot faster when connection is not (yet) possible and call the PubSubClient::loop() at a much slower pace to reduce CPU usage. (See #847)
This higher CPU load was probably introduced when fixing #683.

* [MQTT] Fix error reporting success status with longer payloads

Applied PR https://github.com/knolleary/pubsubclient/pull/360/files

* made MQTT_CALLBACK_SIGNATURE for esp32 functional

Applied PR knolleary/pubsubclient#336
TD-er added a commit to TD-er/ESPEasy that referenced this issue Feb 17, 2018
When OpenHAB MQTT controller was selected, the web interface becomes unresponsive when the controller tab was selected.
Also the ESP almost freezes when MQTT messages were sent.

This is an attempt to make them work faster.
Also a lot of string manipulation, copies and resizes are now no longer needed and the received values are checked to be valid int or float before used.
psy0rz pushed a commit that referenced this issue Feb 17, 2018
* [issue #847 ] OpenHAB MQTT rewrite topic parser

When OpenHAB MQTT controller was selected, the web interface becomes unresponsive when the controller tab was selected.
Also the ESP almost freezes when MQTT messages were sent.

This is an attempt to make them work faster.
Also a lot of string manipulation, copies and resizes are now no longer needed and the received values are checked to be valid int or float before used.

* signed unsigned warning
@ymoona
Copy link
Author

ymoona commented Feb 17, 2018

I have tested with build: Release v2.0-20180217
Results:
The connection is now less stable with serial disabled. When I enable serial port it becomes more stable.
On ping I still loose connection quite often and the webserver is not always responding to requests. But it no longers becomes completely unreachable.
Reaction to MQTT messages is prompt even when the pings timeout. Also state is pushed back quite prompt, MQTT now has higher prio?

@TD-er
Copy link
Member

TD-er commented Feb 17, 2018

Could you please look at the logs (or via MQTT client) to see if MQTT looses connection to the broker?
What do you mean with "the connection"? Is it WiFi connection, or response to the web pages?

Ping is a bit tricky to use as a monitoring tool, since reply to ping is optional. It may be omitted when the server is too busy. (where the definition of "too busy" is quite vague)

The MQTT doesn't have higher prio, but it is just scheduled differently now.
You could tweak a little with the timer settings in here:

ESPEasy/src/ESPEasy.ino

Lines 1292 to 1304 in 5c4defd

if (timeOutReached(timermqtt)) {
// MQTT_KEEPALIVE = 15 seconds.
timermqtt = millis() + 250;
//dont do this in backgroundtasks(), otherwise causes crashes. (https://github.com/letscontrolit/ESPEasy/issues/683)
int enabledMqttController = firstEnabledMQTTController();
if (enabledMqttController >= 0) {
if (!MQTTclient.loop()) {
if (!MQTTCheck(enabledMqttController)) {
// Check failed, no need to retry it immediately.
timermqtt = millis() + 500;
}
}
}

The "250" was just a wild guess on my side, since it is hard to predict what effects it will have.
You may want to experiment with it between say 1 msec and 1 sec, to see what happens.

The 500 value was already there, but used for retry when the broker could not be reached.

Also try to look at the uptime. Just to make sure it doesn't reboot.

@ymoona
Copy link
Author

ymoona commented Feb 17, 2018

With connection I indeed mean ping or webserver response.
I will try with different values.

@ymoona
Copy link
Author

ymoona commented Feb 17, 2018

I have done some tests
value:50 - Ping response and webserver response is good.
MQTT response seems a bit slower.
Not sure how to interpret the load, sometimes it is at: 60% (LC=8525) but is also dropped to 1%.
It seems to go up when I send MQTT messages.

Have not tested other values yet.
Will also keep an eye on the uptime

@TD-er
Copy link
Member

TD-er commented Feb 17, 2018

There are some other delays introduced recently.
You can look in the code for WiFiConnected This is a function with a maximum timeout.
If there is no problem with the WiFi connection, it should not impose a delay at all.
You could try to change in the code of that function to set the minimum delay to 1 msec and set the divider a bit higher (now set to 10).
If that does change it for you, you could also try to lower the values when calling the function (easier to change it in that function than to replace it for all calls)
Perhaps a good calling value is 10 and minimum delay of 1 msec.

@yondom
Copy link

yondom commented Feb 17, 2018

@

@TD-er
Copy link
Member

TD-er commented Feb 17, 2018

Hmm that reply may need some explanation ;)

@ymoona
Copy link
Author

ymoona commented Feb 18, 2018

It seems there is no wifi connection problem as changes to the WiFiConnected function dont make a difference.
Values of 10 and below make the esp not connect to the AP anymore.

@TD-er
Copy link
Member

TD-er commented Feb 18, 2018

Which values do you mean? The update interval to process the MQTT? (the "250 ms" value)

Edit:
Ah, I understand, the delay in the WiFiConnected() function.

@ymoona
Copy link
Author

ymoona commented Feb 18, 2018

No I meant the this part:
timermqtt = millis() + 10;

@TD-er
Copy link
Member

TD-er commented Feb 18, 2018

OK, so if that's too low, it will 'choke' the ESP on requests?

Funny part is, that its original implementation (at the start of this thread) was way shorter interval.
The call to the PubSubClient::loop() function was done every call of the main loop (the LC value used to compute the CPU-load)
I guess that value represents the number of loops per 30sec, so it was about 15k - 20k per 30 sec.
That's over 500 times per second. (similar to 2 msec delay)

Now does the call to the PubSubClient::loop() function do a bit more, but still it is currently a lot less than it used to be.
Also the call to make a reconnect is now less blocking and I guess I can do even better now that we are using somewhat of a scheduling system.
For example starting a reconnect can be split into several stages after which the next cycle is performed and upon return evaluate the reply, do the next stage, etc.
That will make it even perform better for any MQTT controller.

I just have to think more in streaming mode and do baby steps each run of the main loop.

TD-er added a commit to TD-er/ESPEasy that referenced this issue Feb 19, 2018
…B MQTT

Thanks [jivkopavlovski](letscontrolit@f23d9fa#commitcomment-27660279) for pointing this out (and testing ofcourse)
TD-er added a commit that referenced this issue Feb 19, 2018
[issue #847] Fix last letter of command is cut in OpenHAB MQTT
TD-er added a commit to TD-er/ESPEasy that referenced this issue Feb 25, 2018
See letscontrolit#922 and letscontrolit#847

Problem with DNS resolve is that it is quite slow when resulting in a failure and it tends to keep the ESP occupied and make it unresponsive.

Changed the way it is connecting to a controller.
Not sure if http requests are still OK, maybe we should make an exception for those when selected to use DNS.
psy0rz pushed a commit that referenced this issue Feb 25, 2018
See #922 and #847

Problem with DNS resolve is that it is quite slow when resulting in a failure and it tends to keep the ESP occupied and make it unresponsive.

Changed the way it is connecting to a controller.
Not sure if http requests are still OK, maybe we should make an exception for those when selected to use DNS.
@TD-er
Copy link
Member

TD-er commented Feb 27, 2018

Has this issue been improved with the last few days of commits?

@TD-er TD-er added the Type: Bug Considered a bug label Mar 2, 2018
@TD-er TD-er added this to the 2.0.0 milestone Mar 2, 2018
@TD-er
Copy link
Member

TD-er commented Mar 23, 2018

Recently a lot has changed on how services will be started and connections made.

Is this still an issue?

@ymoona
Copy link
Author

ymoona commented Mar 23, 2018

I have flashed them with tasmota and they are not accessible anymore as they built in. So I cannot thest this anymore. Sorry

@TD-er
Copy link
Member

TD-er commented Mar 23, 2018

OK, understandable.
Should I close this issue?

@ymoona
Copy link
Author

ymoona commented Mar 23, 2018 via email

@TD-er TD-er closed this as completed Mar 23, 2018
@kozoke
Copy link

kozoke commented May 29, 2018

Hi - i'm trying to control a neopixel candle from Home Assistant. I've tried using MQTTBox to change the settings of the candle (type, brightness, color) but nothing appears to change. this bug appeared to be related. My espeasy install version is:
{"System":{
"Build":20102,
"Git Build":"mega-20180524",
"System libraries":"ESP82xx Core 2_4_1, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3",
Neopixel candle json:
{
"TaskValues": [
{"ValueNumber":1,
"Name":"color",
"NrDecimals":0,
"Value":1346780
},
{"ValueNumber":2,
"Name":"brightness",
"NrDecimals":0,
"Value":168
},
{"ValueNumber":3,
"Name":"type",
"NrDecimals":0,
"Value":7
}],
"DataAcquisition": [
{"Controller":1,
"IDX":0,
"Enabled":"true"
},
{"Controller":2,
"IDX":0,
"Enabled":"false"
},
{"Controller":3,
"IDX":0,
"Enabled":"false"
}],
"TaskInterval":10,
"Type":"Output - NeoPixel (Candle)",
"TaskName":"led",
"TaskEnabled":"true",
"TaskNumber":6
}
``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Considered a bug
Projects
None yet
Development

No branches or pull requests

9 participants