Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External reconnect fails as long as "old" connection has not been terminated gracefully #843

Closed
n0l0cale opened this issue May 22, 2020 · 16 comments
Labels
⚓ upstream issue This is a bug/issue for/in upstream software (OCCU, etc.) 🐛 bug-report Something isn't working 🏷️ HmIPServer This refs the HmIPServer component 👍 important This is an important issue/ticket with high priority

Comments

@n0l0cale
Copy link

Describe the bug
RaspberryMatic seems not accept new connections, as long as the old one have not been terminated.

To Reproduce
Steps to reproduce the behavior:

  1. connect RaspberryMatic to e.g. Home Assistant
  2. configure some sensors to be monitored by your control software
  3. restart your RaspberryMatic Software

Expected behavior
A reconnect should be successful.

System information (please complete the following information):

  • Version RaspberryMatic 3.51.6.20200420
  • Hardware RaspberryPi3 B+
  • for reference: Homematic Version 0.110.1

Additional context
This issue has also been discussed in https://community.home-assistant.io/t/homematic-reconnect-does-not-work/161853/18

@danielperna84
Copy link

To clarify, this is about XML-RPC. But only for HmIP. There are no problems with the old wired and wireless protocols.
That being said, I don't believe RaspberryMatic can do anything about this.

@n0l0cale
Copy link
Author

thought maybe we can come a step further when more guys are involved.....

@jens-maus
Copy link
Owner

@n0l0cale Please provide adequate example XMLRPC communication output between home assistent and the CCU/RaspberryMatic/HmIPServer where the issue is demonstrated. Only with detailed information you will have any chance that this issue will get fixed by eQ3.

@jens-maus jens-maus added ⚓ upstream issue This is a bug/issue for/in upstream software (OCCU, etc.) 🐛 bug-report Something isn't working 🕙 awaiting feedback waiting for feedback from ticket owner ❓ undecided No decision to accept or reject ticket yet 🏷️ HmIPServer This refs the HmIPServer component labels May 22, 2020
@n0l0cale
Copy link
Author

n0l0cale commented May 22, 2020

@danielperna84 - could you please give me assistance with providing such a sample? I think you are deeper in this than I am....

@no-response no-response bot removed the 🕙 awaiting feedback waiting for feedback from ticket owner label May 22, 2020
@danielperna84
Copy link

@n0l0cale
I'll see if I can come up with a script that continuously reproduces the error. Maybe over the weekend.

@danielperna84
Copy link

danielperna84 commented May 22, 2020

Ok, so I have created the following test-script: https://gist.github.com/danielperna84/28653b1ebf760064cdbecf5f4828f5f3

Adjust the variables at the top (the IP addresses mainly), then execute it with python3 hmip-xml-rpc-bug.py. Don't forget to allow the incoming traffic on the specified port.

On the CCU / RaspberryMatic you'll get something the following when the second init request has been sent:

==> hmserver.log <==
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.LegacyServiceHandler INFO  [vert.x-worker-thread-2] (un)registerCallback on LegacyServiceHandler called from url: http://192.168.1.30:32001
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.LegacyServiceHandler INFO  [vert.x-worker-thread-0] init finished
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer INFO  [vert.x-worker-thread-3] Added InterfaceId: testclient
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer ERROR [vert.x-worker-thread-3] Serialization Exception: Could not add interface: testclient
de.eq3.cbcs.legacy.communication.rpc.RpcSerializationException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]
Message: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRequestResponseProcessor.parseResponse(XmlRequestResponseProcessor.java:173)
        at de.eq3.cbcs.legacy.communication.rpc.internal.transport.http.HttpTransport.sendRequest(HttpTransport.java:107)
        at de.eq3.cbcs.legacy.communication.rpc.internal.rpc.RpcClient.sendRequest(RpcClient.java:94)
        at de.eq3.cbcs.legacy.communication.rpc.internal.rpc.RpcClient.invoke(RpcClient.java:82)
        at com.sun.proxy.$Proxy22.listDevices(Unknown Source)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.LegacyBackendClient.listDevices(LegacyBackendClient.java:139)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.DeviceUtil.synchronizedBackendDevices(DeviceUtil.java:144)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer.handle(InterfaceInitializer.java:109)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer.handle(InterfaceInitializer.java:26)
        at io.vertx.core.eventbus.impl.HandlerRegistration.deliver(HandlerRegistration.java:212)
        at io.vertx.core.eventbus.impl.HandlerRegistration.handle(HandlerRegistration.java:191)
        at io.vertx.core.eventbus.impl.EventBusImpl.lambda$deliverToHandler$3(EventBusImpl.java:505)
        at io.vertx.core.impl.ContextImpl.lambda$wrapTask$2(ContextImpl.java:337)
        at io.vertx.core.impl.TaskQueue.lambda$new$0(TaskQueue.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]
Message: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.nextTag(XMLStreamReaderImpl.java:1257)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseObject(XmlRpcParser.java:437)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseValue(XmlRpcParser.java:394)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseArray(XmlRpcParser.java:521)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseValue(XmlRpcParser.java:387)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseMethodResponse(XmlRpcParser.java:164)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRequestResponseProcessor.parseResponse(XmlRequestResponseProcessor.java:171)
        ... 16 more

This only happens for HmIP. Doing the same with port 2001 shows no errors in the hmserver.log (although I don't get ANY logs for 2001).

At the bottom of the script the wait times could be increased. What will be observed is, that after the first init we receive events from the devices (if we wait long enough for events to happen obviously). After the second init we don't receive any events anymore. I assume this is a result of the exception.

That being said, the reason for this behavior seems to be, that we don't just return an empty list to the CCU after the init - it already has the device descriptions from the first init. If a clearDevices() is added after the first time.sleep(5), we don't get any errors on the CCU during the second init. This fits the observation in Home Assistant, which needs to be restarted to receive events again. The restart of Home Assistant clears the local devices, which is why upon the next startup it will be in the state like this script is during the first init.

So to sum this up: the HMIPServer on the CCU fails to accept init-requests if the listDevices-call it makes to our server returns more than just an empty list. With the old RF and wired devices this works without any issues.
From my point of view this definitely is a bug, because there would be no point in asking the remote server for the known devices if the presence of devices leads to a failure. On top of that, the CCU should not require the remote-end to clear its knowledge about present devices and set them up all the way from the beginning.

As far as I am aware of, the XML-RPC API documentation does not state, that the format of the response to the listDevices call has to match a specific format. Hence copying what the CCU itself returns when calling listDevices (that's essentially what the script is doing) should not lead to such an error. And again, this works for the old wired and RF interfaces. So for HmIP it should either behave the same way, or the limitations need to be communicated clearly.

Edit:
One more detail I found while looking at this on RaspberryMatic with strace is, that the response of our server always gets cut to 8192 bytes. The (redacted) trace (done on a different machine) looks like this:

...
837   send(101, "POST / HTTP/1.1\r\nContent-Length: 158\r\nHost: 192.168.1.142:43043\r\nConnection: Keep-Alive\r\nUser-Agent: Apache-HttpClient/4.5.2 (Java/1.8.0_232)\r\n\r\n
<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><methodCall><methodName>listDevices</methodName><params><param><value>my-hmip</value></param></params></methodCall>", 303, 0) = 303
837   poll([{fd=101, events=POLLIN|POLLERR}], 1, 600000) = 1 ([{fd=101, revents=POLLIN}])
837   recv(101, "HTTP/1.0 200 OK\r\nServer: BaseHTTP/0.6 Python/3.7.3\r\nDate: Wed, 29 Apr 2020 22:04:01 GMT\r\nContent-type: text/xml\r\nContent-length: 145462\r\n\r\n
<?xml version='1.0'?>\n<methodResponse>\n<params>\n<param>\n<value><array><data>\n<value><struct>\n<member>\n<name>TYPE</name>\n<value><string>HMIP-PSM</string></value>...............<string>SERVICE</string></value>\n</data></array></value>\n</member>\n<member>\n<name>FIRMWARE</name>\n<value><string>2.6.2</string></value>\n<", 8192, MSG_DONTWAIT) = 1599
837   dup2(55, 101)                     = 101
837   close(101)                        = 0
837   write(2, "de.eq3.cbcs.legacy.communication.rpc.RpcSerializationException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]\nMessage: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT", 198) = 198
...

Obviously the XML is bad since it ends abruptly. So the exception makes sense. The question is why the CCU only receives the first 8192 bytes. I also traced the other end, and the data that gets sent as a response is not truncated. This leads me to believe, that on the CCU there seems to be some buffer-limitation for incoming data.

@n0l0cale
Copy link
Author

n0l0cale commented Jun 3, 2020

@jens-maus hi man - should I support you in writing a bug report to the upstream project? Do you have any contact details or do you want to do it by your own? Do you need maybe any further information?

@jens-maus
Copy link
Owner

@n0l0cale Nope. here is the correct place for this kind of bug report on HmIPServer because eq3 ist more or less listening/reading here when I point them at this ticket (will do). So thanks for the infos here. As soon as I had time to reproduce it I will flag it as reproduced and then its' eq3 task to investigate+fix.

@jens-maus
Copy link
Owner

jens-maus commented Jun 3, 2020

@danielperna84 I finally tried to reproduce the issue using your hmip-xml-rpc-bug.py script and here are my results:

On the CCU / RaspberryMatic you'll get something the following when the second init request has been sent:

==> hmserver.log <==
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.LegacyServiceHandler INFO  [vert.x-worker-thread-2] (un)registerCallback on LegacyServiceHandler called from url: http://192.168.1.30:32001
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.LegacyServiceHandler INFO  [vert.x-worker-thread-0] init finished
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer INFO  [vert.x-worker-thread-3] Added InterfaceId: testclient
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer ERROR [vert.x-worker-thread-3] Serialization Exception: Could not add interface: testclient
de.eq3.cbcs.legacy.communication.rpc.RpcSerializationException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]
Message: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRequestResponseProcessor.parseResponse(XmlRequestResponseProcessor.java:173)
        at de.eq3.cbcs.legacy.communication.rpc.internal.transport.http.HttpTransport.sendRequest(HttpTransport.java:107)
        at de.eq3.cbcs.legacy.communication.rpc.internal.rpc.RpcClient.sendRequest(RpcClient.java:94)
        at de.eq3.cbcs.legacy.communication.rpc.internal.rpc.RpcClient.invoke(RpcClient.java:82)
        at com.sun.proxy.$Proxy22.listDevices(Unknown Source)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.LegacyBackendClient.listDevices(LegacyBackendClient.java:139)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.DeviceUtil.synchronizedBackendDevices(DeviceUtil.java:144)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer.handle(InterfaceInitializer.java:109)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer.handle(InterfaceInitializer.java:26)
        at io.vertx.core.eventbus.impl.HandlerRegistration.deliver(HandlerRegistration.java:212)
        at io.vertx.core.eventbus.impl.HandlerRegistration.handle(HandlerRegistration.java:191)
        at io.vertx.core.eventbus.impl.EventBusImpl.lambda$deliverToHandler$3(EventBusImpl.java:505)
        at io.vertx.core.impl.ContextImpl.lambda$wrapTask$2(ContextImpl.java:337)
        at io.vertx.core.impl.TaskQueue.lambda$new$0(TaskQueue.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]
Message: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.nextTag(XMLStreamReaderImpl.java:1257)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseObject(XmlRpcParser.java:437)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseValue(XmlRpcParser.java:394)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseArray(XmlRpcParser.java:521)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseValue(XmlRpcParser.java:387)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseMethodResponse(XmlRpcParser.java:164)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRequestResponseProcessor.parseResponse(XmlRequestResponseProcessor.java:171)
        ... 16 more

I was able to reproduce these error outputs in hmserver.log as soon as an listDevices() answer is sent to HmIPServer with a non-empty device list. So yes, there indeed seems to be an issue in dealing with receiving a list of devices with a listDevices() rpc execution in HmIPServer.

This only happens for HmIP. Doing the same with port 2001 shows no errors in the hmserver.log (although I don't get ANY logs for 2001).

Please note that the BidCos-RF service on Port 2001 is not provided by HmIPServer but the rfd service which in fact in a C++ tool and output its debug/error output directly to syslog, hence you have to look into /var/log/messages for potential error messages for the rfd/BidCos-RF service on Port 2001.

At the bottom of the script the wait times could be increased. What will be observed is, that after the first init we receive events from the devices (if we wait long enough for events to happen obviously). After the second init we don't receive any events anymore. I assume this is a result of the exception.

This I could not reproduce here. In fact, even after the second init() I perfectly receive events, at least for the virtual devices I could still trigger in the WebUI.

That being said, the reason for this behavior seems to be, that we don't just return an empty list to the CCU after the init - it already has the device descriptions from the first init. If a clearDevices() is added after the first time.sleep(5), we don't get any errors on the CCU during the second init. This fits the observation in Home Assistant, which needs to be restarted to receive events again. The restart of Home Assistant clears the local devices, which is why upon the next startup it will be in the state like this script is during the first init.

Can you please provide example output of your script showing the exact payload of the error? Would really help if one can see that after the second init() no events are received anymore. As said, I can perfectly use the WebUI here and do a PRESS_SHORT or PRESS_LONG on the internal/virtual devices the WebUI provides and then I can see these events in the python script, even after the second init.

From my point of view this definitely is a bug, because there would be no point in asking the remote server for the known devices if the presence of devices leads to a failure. On top of that, the CCU should not require the remote-end to clear its knowledge about present devices and set them up all the way from the beginning.

I fully agree with you that it should be perfectly able to directly provide a device list in the listDevices() answer and the XMLRPC service of HmIPServer should then just diff the provided devices and just call newDevices() with the devices missing. However, please note that the XMLRPC Service in HmIPServer is unfortunately quite legacy and has some shortcomings/bugs. But I will try to get eQ3 informed and see that this issue will be resolved. As a workaround (until the issue is fixed), I would suggest that you change the initialization in home assistent that no device list is provided in the listDevices() execution for the HmIP service query and that you do the diffing (old vs new devices) on your own. AFAIK ioBroker and other third-party services are doing the same and do not rely on the XMLRPC service for doing the listDevices() based diffing as one would expect. See here, for example:

https://github.com/ioBroker/ioBroker.hm-rpc/blob/master/hm-rpc.js#L1013

There, ioBroker.hm-rpc is doing the same for the HmIP service and I think for the exact same reason.

As far as I am aware of, the XML-RPC API documentation does not state, that the format of the response to the listDevices call has to match a specific format. Hence copying what the CCU itself returns when calling listDevices (that's essentially what the script is doing) should not lead to such an error. And again, this works for the old wired and RF interfaces. So for HmIP it should either behave the same way, or the limitations need to be communicated clearly.

Nope, the format should be perfectly fine as you provided it, at least AFAICS.

One more detail I found while looking at this on RaspberryMatic with strace is, that the response of our server always gets cut to 8192 bytes. The (redacted) trace (done on a different machine) looks like this:

...
837   recv(101, "HTTP/1.0 200 OK\r\nServer: BaseHTTP/0.6 Python/3.7.3\r\nDate: Wed, 29 Apr 2020 22:04:01 GMT\r\nContent-type: text/xml\r\nContent-length: 145462\r\n\r\n
<?xml version='1.0'?>\n<methodResponse>\n<params>\n<param>\n<value><array><data>\n<value><struct>\n<member>\n<name>TYPE</name>\n<value><string>HMIP-PSM</string></value>...............<string>SERVICE</string></value>\n</data></array></value>\n</member>\n<member>\n<name>FIRMWARE</name>\n<value><string>2.6.2</string></value>\n<", 8192, MSG_DONTWAIT) = 1599
...

Obviously the XML is bad since it ends abruptly. So the exception makes sense. The question is why the CCU only receives the first 8192 bytes. I also traced the other end, and the data that gets sent as a response is not truncated. This leads me to believe, that on the CCU there seems to be some buffer-limitation for incoming data.

Thanks for that one. This indeed looks like there is a buffer limiting factor here that could be one of the main reasons why using listDevices() with the HmIPServer XMLRPC service is not working correctly ATM. I will forward these observations to my eQ3 contacts and hope they will fix it in the next release.

@jens-maus jens-maus added 👍 important This is an important issue/ticket with high priority and removed ❓ undecided No decision to accept or reject ticket yet labels Jun 3, 2020
@danielperna84
Copy link

This I could not reproduce here. In fact, even after the second init() I perfectly receive events, at least for the virtual devices I could still trigger in the WebUI.

I have to correct myself. I have mixed up different scenarios in my head while doing all this. The initial problem how all this got started was, that when the CCU gets rebooted (and thus requires a new init before it starts sending events), we use an automation to reconnect (in the backgorund it's a re-init to be precise, but reconnect sounds more straightforward for users) once the CCU is back up again.
In this state the CCU obviously won't push events. And Home Assistant already knows about the devices, which it why it responds with the filled array during the re-init. So I was also doing those reboot-tests with my script, and there I got the observed result. Without the reboot the events keep coming. So what you've experienced is the expected behavior.

I agree, that following your proposal of just returning an empty array would be a temporary fix. But it would come with some great drawbacks caused by the current architecture of the code. Significant changes would need to be made just to work around the bug. So if there is a chance of this being fixed by eq3, I would favor not to take the chance to introduce bugs or other problems in our code. Waiting a bit seems like the better solution in this case. After all, this bug has been present for so long, that a few more weeks / months won't really matter in my opinion. 😉

@jens-maus
Copy link
Owner

@danielperna84 I just received notification that the above mentiond issue in HmIPServer should be fixed in the next upcoming CCU/OCCU firmware release. The root cause of the issue have been, however, that the internal xmlrpc methods couldn't work well with empty xml tags and caused the above mentioned java exceptions. Thus, a xmlrpc method response like the following one already triggered the issue when sent to HmIPServer:

<?xml version="1.0" ?><methodResponse><params><param><value><string></string></value></param></params></methodResponse>

Please note the empty <string></string> tag that actually triggered the issue internally in HmIPServer. So to workaround the issue for older HmIPServer versions please make sure to not send over empty <string> tags or make sure the tag always contains some text to parse.

@jens-maus jens-maus added this to the future release milestone Jun 8, 2020
@danielperna84
Copy link

Ok, so this essentially is the same as #489. Thanks for your help. 👍

@hobbyquaker
Copy link
Contributor

eq-3/occu#83

@jens-maus
Copy link
Owner

This issue should be fixed since a few versions successfully. Please retry and reopen if still problematic.

@n0l0cale
Copy link
Author

Hi all,

I've been testing this with every version - latest 3.55.10.20210213 included - and this issue seems not to be fixed. Can someone else confirm this?

n0l0cale

@f0rdprefect
Copy link

Looks like I just got hit by this. I had one way connection to homematic devices, but both ip and classic where I could set states but would not get any status back.

As usual when strange things happen I tried to restart. First the raspberrymatic (3.57.5.20210525) Then HA. Now everything is back to bidirectional communication.

In the homematic alarm messages I could see a warning about a lost internet connection. After all this I started looking and found this issue. So I guess it should be reopened?!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚓ upstream issue This is a bug/issue for/in upstream software (OCCU, etc.) 🐛 bug-report Something isn't working 🏷️ HmIPServer This refs the HmIPServer component 👍 important This is an important issue/ticket with high priority
Projects
None yet
Development

No branches or pull requests

5 participants