Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious disconnect when Homie connects over TLS/SSL #46

Closed
jpmens opened this issue Mar 23, 2016 · 30 comments
Closed

Spurious disconnect when Homie connects over TLS/SSL #46

jpmens opened this issue Mar 23, 2016 · 30 comments
Labels
Milestone

Comments

@jpmens
Copy link

jpmens commented Mar 23, 2016

As first reported in #44 I'm seeing a disconnect when Homie connects via TLS (this does not occur in plain connections). Screenshot:

screenshot_2016-03-23_09-41-00

Server-side, the MQTT broker identifies an SSL handshake error, even after it sees the initial publishes from the device:

1458722417: OpenSSL Error: error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure
1458722417: Socket error on client Homie-cf3a07e0, disconnecting.

It would appear to me there's something corrupt going over the wire.

After the second connection succeeds, the Homie device carries on happily publishing for hours on end. This behaviour is reproducible.

@jpmens
Copy link
Author

jpmens commented Mar 23, 2016

The following sketch is basically a copy of your LightOnOff example. You will notice that I added a second node called ping which doesn't do terribly much.

#include <Homie.h>

const int PIN_RELAY = 16;

HomieNode lightNode("light", "switch");
HomieNode pingNode("ping", "pong");

bool lightOnHandler(String message) {
  if (message == "true") {
    digitalWrite(PIN_RELAY, HIGH);
    Homie.setNodeProperty(lightNode, "on", "true"); // Update the state of the light
    Serial.println("Light is on");
  } else if (message == "false") {
    digitalWrite(PIN_RELAY, LOW);
    Homie.setNodeProperty(lightNode, "on", "false");
    Serial.println("Light is off");
  } else {
    return false;
  }

  return true;
}

bool pingHandler(String message) {
  Serial.println("ping here");
  Homie.setNodeProperty(pingNode, "pong", "Bonjour!");
  return true;
}

void setup() {
  pinMode(PIN_RELAY, OUTPUT);
  digitalWrite(PIN_RELAY, LOW);

  Homie.setFirmware("sub2-relay", "1.0.0");
  lightNode.subscribe("on", lightOnHandler);
  pingNode.subscribe("on", pingHandler);
  Homie.registerNode(lightNode);
  Homie.registerNode(pingNode);
  Homie.setup();
}

void loop() {
  Homie.loop();
}

When run over a plain MQTT connection it works. When running over an TLS connection the device disconnects and reconnects randomly and seemingly doesn't react at all (neither to light nor to ping).

I'm hoping this can help you diagnose the issue.

@marvinroger
Copy link
Member

The issue happening obviously because of the heap, I'll try to remove most of the dynamic allocations and instead allocate on the stack. The good thing is I am learning a lot resolving this issue!

@marvinroger
Copy link
Member

Could you please try again? It seems stable on my side after 2be5da0

@jpmens
Copy link
Author

jpmens commented Mar 23, 2016

I'm sorry to disappoint you, but no change here. In plain mode all works fine as far as I can tell. In TLS mode, the moment I try to talk to the device (e.g. relay on), it disconnects.

@jpmens
Copy link
Author

jpmens commented Mar 24, 2016

So, testing with the above sketch on TLS again, this time a bit more carefully. The screenshot shows what happens initially:

screenshot_2016-03-24_07-38-14

In fact, I don't even have to try and switch "light on"; the device disconnects by itself quite periodically, whereupon the broker logs

1458801612: Sending SUBACK to Homie-cf3a07e0
1458801612: OpenSSL Error: error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure
1458801612: Socket error on client Homie-cf3a07e0, disconnecting.

The first couple minutes of the socket errors seen on the broker in human readable times show this happens about every 5 seconds:

1458801612 Thu Mar 24 07:40:12 2016
1458801617 Thu Mar 24 07:40:17 2016
1458801623 Thu Mar 24 07:40:23 2016
1458801673 Thu Mar 24 07:41:13 2016
1458801703 Thu Mar 24 07:41:43 2016
1458801704 Thu Mar 24 07:41:44 2016
1458801709 Thu Mar 24 07:41:49 2016
1458801714 Thu Mar 24 07:41:54 2016
1458801719 Thu Mar 24 07:41:59 2016

@jpmens
Copy link
Author

jpmens commented Mar 24, 2016

If I comment out the two lines in above sketch:

// pingNode.subscribe("on", pingHandler);
// Homie.registerNode(pingNode);

the device is much more stable: after the initial disconnect it stays up and I can talk to it.

@marvinroger
Copy link
Member

There are still some dynamic allocations done:

  • The registered nodes
  • The MQTT topic buffer
  • The config file buffer
  • The strings in the config

A solution would be to completely avoid these dynamic allocations, but it would mean setting hard limits in the code:

  • Limit the max length of a node id / type
  • Limit the number of subscriptions per node
  • Limit the number of nodes
  • Limit the max friendly $name of a device

... etc.

If this does not solve the issue, then the problem is that the framework is oversized for the ESP8266

@jpmens
Copy link
Author

jpmens commented Mar 24, 2016

I think it's a case of memory being cloberred somewhere ...

When you said Limit the max friendly $name of a device, I thought "Hey!", and changed $name from Livingroom bookshelf to N1. What happens now, is

1458815935: New connection from 192.168.1.193 on port 9993.

followed by this:

screenshot_2016-03-24_11-42-01

though not readily reproducible.

@jpmens
Copy link
Author

jpmens commented Mar 24, 2016

I've been going backwards in largish steps to commit f6ee49a to see if this has anything to do with me showing up on the scene here. It doesn't. 😄

All the time we've spoken so far, I was testing with a Wemos-D1 mini. I didn't really expect it would make a difference, but the NodeMCU (Amica) behaves identically: the above sketch just won't behave properly.

@marvinroger
Copy link
Member

To be honest, when I implemented SSL I never tested it. SSL encryption takes up more than a quarter of the available heap, so that does not surprise me.

@jpmens
Copy link
Author

jpmens commented Mar 24, 2016

Does the following information help at all? Of course I realize I'm comparing apples and potatoes; I'm just trying to be helpful...

Screenshot of memory requirements for sub2.ino (the small sketch above):

screenshot_2016-03-24_11-54-47

Screenshot of memory requirements for a simple MQTT over TLS using Imroy/pubsubclient:

screenshot_2016-03-24_16-37-01

@marvinroger
Copy link
Member

And you are helpful! But this piece of information shows stack usage, not heap, which is our problem here. I am working on moving everything to the stack in a clean way. The problem is this will be a breaking change.

@jpmens
Copy link
Author

jpmens commented Mar 24, 2016

Well, in my possibly not very humble opinion, you should go ahead and break it so that it's fixed! 😄

@marvinroger
Copy link
Member

You're right!

@marvinroger
Copy link
Member

It's not a breaking change, so good news, and the problematic sketch seems to work. Could you please test?

@marvinroger
Copy link
Member

By the way, maybe fingerprint is fixed too?

@jpmens
Copy link
Author

jpmens commented Mar 26, 2016

The good news is, it is much better, the bad news is, it is not good. I have the following observations:

  1. My small test sketch (sub2.ino as shown above) actually works, and I was able to "ping" it.
  2. Fingerprint checking doesn't work: it's actually completely ignored, i.e. wether I use correct or incorrect fingerprint in configuration makes no difference. (I haven't been able to look at your code though)
  3. It's touch and go wether sketches work. I was pleased to see that my slightly larger sketch (4 nodes) worked and I could switch a relay, but then it died. Whether or not the sketch stays alive is a bit of touch and go.

Looking slightly closer, I notice that you've changed the way subscriptions work. Contrary to previously where individual subscribes were issued, I now see a single wild-card subscription:

1459003383: Received SUBSCRIBE from Homie-cf3a07e0
1459003383:     sensors/homie/cf3a07e0/# (QoS 1)

I think Homie is literally killing itself! My previous experience with PubSubClient on Arduino shows that if I send out a few messages "en block" it just dies. In this particular case, Homie is sending out several publishes, eg. during setupHandler, and they're being bounced back to the device; I think it then just dies.

I will try, time permitting, to observe this a little closer, but I believe that ought to be the reason...

Be that as it may, congrats so far!

@jpmens
Copy link
Author

jpmens commented Mar 26, 2016

What would probably help a bit to mitigate the issue is to narrow-down the wildcard to something like this:

sensors/homie/cf3a07e0/+/+/set

This would prevent most publishes from a Homie device being sent back to the same Homie device. I'm 3/4 or 7/8 convinced the device is hanging itself up due, not to your code, but to PubSubClient.

@marvinroger
Copy link
Member

You're right, the wildcard subscription change is due to an issue with PubSubClient when subscribing to 5 topics or more. Subscribing to only /set messages, /$ota and /$reset would reduce the load. I'll try next week, thanks for the great suggestion!

@jpmens
Copy link
Author

jpmens commented Mar 28, 2016

I've had a Homie device running since my last comment (approx 24h) on TLS, and it's still publishing three messages every 10s:

sensors/homie/cf3a07e0/temperature/degrees 22.50
sensors/homie/cf3a07e0/temperature/freeheap 11504
sensors/homie/cf3a07e0/ping/epoch 1459150716

I do note, however, that it periodically disconnects and then reconnects. The broker logs

1459150544: OpenSSL Error: error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure
1459150544: Socket error on client Homie-cf3a07e0, disconnecting.

@marvinroger
Copy link
Member

I can't optimize things more than this, if it's still an issue, it has to be solved in PubSubClient or in the Arduino for ESP8266 SSL implementation. 😕

@marvinroger marvinroger added this to the 1.3.0 milestone Mar 28, 2016
@marvinroger
Copy link
Member

I have an idea. By default, the ESP runs at 80MHz. I have seen projects running at 160MHz when using SSL. With twice the power of calculation, maybe the handshake failure will disappear, and the touch and go you're describing might disappear... What do you think?

@jpmens
Copy link
Author

jpmens commented Mar 28, 2016

As long as the chip then doesn't burn a hole through my desk, I'd say "let's try it"! ;-)

@marvinroger
Copy link
Member

Could you please try your "heavy" sketch?

@jpmens
Copy link
Author

jpmens commented Mar 29, 2016

One word: SHIP IT! (Ok, those were two words, but you get the drift.)

  1. Homie hasn't been able to kill itself with pubs/subs yet; it survived many resets, which it didn't do previously.
  2. My heavy sketch has been running sans disconnects since 08:16 this morning. (I'll continue to monitor for 24h)
  3. Fingerprint checking works! (I've tested positive and negative.)

You've done a tremendous job, thank you.

I would like to recommend adding something like this to the Wiki:

Fingerprint

Adding a TLS fingerprint effectively pins the device to a particular certificate. Furthermore, as currently implemented by WifiSecureClient(), both mqtt.host and ota.host are verified against the server certificate's common name (CN) in the certificate subject or in the SANs (subjectAlternateName) contained in it, but not in their IP addresses. So, if the certificate used by your server looks like

Subject: CN=tiggr.example.org, OU=generate-CA/emailAddress=nobody@example.net
...
X509v3 Subject Alternative Name:
  IP Address:192.168.1.10, DNS:broker.example.org

enabling fingerprint in Homie will work only if host is set to tiggr.example.org or broker.example.org and the correct fingerprint is used; setting host to the IP address will cause fingerprint verification to fail.

@marvinroger
Copy link
Member

Great! We made it, thanks a lot for your tests, it helped so much. I also added a note on the docs. Note that the Wiki is no more used, because the Wiki was up-to-date with the latest git rev, and so people using an older version were not able to have instructions related to their version. So now docs are embedded into the repo. 😉

@marvinroger
Copy link
Member

You said you were monitoring for 24h, is it still alright?

@jpmens
Copy link
Author

jpmens commented Mar 30, 2016

I didn't want to bother you with it ... ;-) Yes, it's basically alright. I've had the following reconnects during the past 24h, but much better than it was:

2016-03-29 19:43:47.701872
2016-03-29 21:33:58.774420
2016-03-30 11:43:53.941828

So, I'd say it's fine.

@marvinroger
Copy link
Member

Awesome! Maybe the 2.2.0 release of the Arduino for ESP8266 will fix things even more, but I can't do more in Homie. Can we close?

@jpmens
Copy link
Author

jpmens commented Mar 30, 2016

Fermé! ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants