New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telegesis: Devices becoming unresponsive #57
Comments
Thanks - yes, it’s caused by the TX issue we discussed last week. I’ll take a look at resolving this, but unfortunately it will be a few weeks till I do as I’m off on holiday from tomorrow.
|
No problem, I just wanted to provide the required information in a separate issue since it's not really related to the state update problem. Enjoy your vacation! |
No probs - thanks.
|
Hi Chris, I've spent some time trying to narrow down my issue and believe I've narrowed it down to the same problem as this. I've been able to pin-point the moment that this occurs too:
I had to miss pasting a character on that unknown input string, it completely stopped the rest of the paste!? From then on, the TX queue just keeps on growing and none of my lights respond:
For a full log, which includes that character: |
Sorry @BClark09 missed this - I'll take a look at this exception tomorrow. I'm not completely convinced it's the only reason for this though, but it's still one to solve, then we can see... |
Note to self: This exception is 'ok' - it's handled fine. It's caused by a corrupted frame from the dongle - this is not unexpected since there is no error detection/control with the Telegesis protocol. Edit: the problem is it is processed a second time as part of the transaction handler - the exception will kill the thread there. |
I've just kicked off a build so this should be included in the binding when that's complete (give it 30 minutes). @BClark09 the exception you saw is now fixed so if this problem is solely related to that, then this issue might be resolved ;). Actually, the exception you logged was not an issue, but the same exception would have also been logged to the console and it's that one that killed the thread. Ultimately these errors will continue to occur due to corrupted frames, but they are logged as debug messages so can be ignored. @weakfl Let me know how it looks in a few days... |
Alright thanks again @cdjackson! I'll see what occurs when I get home this evening. I've setup a new openHAB instance with just the zigbee binding on it, so it'll be easy to separate and pick apart logs from now :) |
Ok, cool - thanks. Any feedback is welcome :)
… On 6 Jan 2018, at 15:22, Ben Clark ***@***.***> wrote:
Alright thanks again @cdjackson <https://github.com/cdjackson>! I'll see what occurs when I get home this evening. I've setup a new openHAB instance with just the zigbee binding on it, so it'll be easy to separate and pick apart logs from now :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#57 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_kQyO1__EkmV82Vrg9A7DMDp7DPVz1ks5tH4-6gaJpZM4Q9Ne1>.
|
I got the same type of exception: It's getting better for certain, I left it running for a while and eventually the lights would take about 27 seconds to respond: The logs at which I turn on the Kitchen_Light_Switch start at these lines:
The last RX command that is seen occurs at:
Hope this helps! |
This is fine - as I said they will continue. They are logged as debug level and are ultimately unavoidable since it's caused by corruption on the serial port. At least for the moment I prefer to leave them in the logs - maybe at some point I'll remove them. I don't think it's causing any sort of issue though is it - it's handled, and it ignored. If they are happening often, then it will likely cause some problems as it will result in timeouts but there's no way around that as the frame is corrupt and there's no error detection in the Telegesis protocol. Even after the exception the responses are coming back fine....
I was however suspicious that the exceptions were not the cause of this issue and I think your second log shows that - although I've not looked at it in detail yet. |
Yep, that makes perfect sense. Just wanted to make sure that this is expected. Please let me know if I'm throwing lots of nothing at you. I was sure that I uploaded my own log as a file first time, but as I look at my first post I somehow managed to reupload @weakfl's one? Sorry about that! |
Yep, that makes perfect sense. Just wanted to make sure that this is expected. Please let me know if I'm throwing lots of nothing at you.
It’s fine. I’ll try and look at this more in the next few days. I think there’s a thread sync issue that’s causing this so it needs a bit of rework to check the way the TX/RX works I think.
I was sure that I uploaded my own log first time, but as I look at my first post I somehow managed to reupload @weakfl <https://github.com/weakfl>'s one? Sorry about that!
Which log do you mean (just so I don’t spend time looking at incorrect logs ;) )
|
This one was a reupload, which I have now redacted:
Based on what you've said I don't think the actual one I meant is any use as it's all contained in the post. RXStop.log may have good information though. |
I have the latest version running since morning. TX queue is still at 2 and devices are responsive. It's probably a bit early to tell, but things are looking good so far. |
Ok, thanks. I’ve been looking over this again over the past hour and have added a timer to time out any transactions that don’t complete. This might be another cause for the problem from looking at I've also found I didn't catch all exceptions caused by corrupt frames, and this is what caused the issue in @BClark09 RXStop log. I'll post another version later tonight most likely. |
#79 includes the Telegesis handler transaction timer and improves the exception trapping. The timer is set at 500ms - if the stick doesn't respond to a transaction in this time, it will be aborted. In my testing, I put a bodge in to randomly drop every 1 in 4 frames and the binding continued to work ok (although probably there would be a load of other problems if the error rate was really this high!). If you start to see timeouts in the log, please let me know - I'm not 100% sure that 500ms will cover every transaction, but I think it should ;) |
Unfortunately the queue has grown to 1800 over night. I‘ll install the latest version now and report back tonight/tomorrow. |
Queue is still growing:
Here's a log from the last couple of hours in case you need it. |
I'm on 2.3.0~20180107193004-1 btw and I'm not sure if your latest changes have been merged yet? Is there a better way to check this than comparing build times? |
In theory there should be a list of changes in cloudbees, but this doesn't seem to work. I've just triggered a new build... |
Looking at your log, I don't think it includes the latest changes, but I also don't think the latest changes will fix this issue. It looks more like I remember from your previous logs (possibly in the thread on the BJ switches) where I think that the TX queue stops. My assumption is that there's a thread sync issue between RX and TX threads, but I might double check the exception trapping to make sure that any exception is not slipping through the net... |
That's quite possible. I'm installing from the 'openHAB 2 Unstable Repository' and not manually. I have no idea when the cloudbees builds are pushed to the package repository and haven't figured out how to tell on which build the packages are actually based on :( I can't thank you enough for your continued efforts! Please let me know if there's anything I can do to help (provide logs or whatever)... |
This one I can help with! The apt package is triggered automatically after the manual build completes, so whenever a snapshot is triggered, the unstable repo is updated within a couple of minutes after the build completes. @cdjackson triggered the build https://openhab.ci.cloudbees.com/job/openHAB-Distribution/1172/ and therefore the apt version That doesn't necessarily mean that your bundles would have updated automatically though. If you use |
Growth has almost stalled, queue has now reached 4 after almost 24 hours. |
Thanks - just for clarity (given I’ve not looked at the log yet) - what do you mean by “growth has stalled”? Is that growth of the log, or the queue is at 4, but has been there a long time, or something else?
… On 11 Jan 2018, at 08:28, weakfl ***@***.***> wrote:
Growth has almost stalled, queue has now reached 4 after almost 24 hours.
Here's the log, but be warned, this one is huge :)
zigbee_queue_20180110-20180111.log.zip <https://github.com/openhab/org.openhab.binding.zigbee/files/1621767/zigbee_queue_20180110-20180111.log.zip>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#57 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_kQ8J6mcSqLA3-wmIVETw0C9-nMZMnks5tJcYjgaJpZM4Q9Ne1>.
|
Well almost stalled :) Previously it would grow pretty fast, now it's growing slowly but it's still growing. It goes up two 2, then varies, but never gets below 2. After a while it goes up to 3, varies again, but never goes below 3, and so on. |
One more observation: first I wasn't sure if it was just bad luck, but it has happened again now: The queue will start to grow pretty fast as soon as I set the log level from DEBUG to ERROR. I had noticed that behaviour yesterday when I turned debug logging off and now it happened again. Queue going up to 7 in like 30 minutes. Is it possible that the supposed thread sync issue has a bigger impact when debug logging is off? I'm not sure how to log this though??? |
From the binding perspective, there’s no change when you change logging, but I guess there will be a small change in processor loading which might make a difference. However, if the current issue is not a total stoppage of communications like we’ve seen before, but “just” a slow increase in the queue count, then I don’t think the issue will be a thread sync issue. Thread sync I think is more likely to completely lock when one thread is waiting for a token that another has, and this would cause communications to totally stop (like we’ve seen before). Maybe these “complete stopping” of comms I’ve seen in your logs before were actually caused by exceptions that weren’t logged so we didn’t see them, but are now cleared up with the fixes over the past week. Previously these exceptions were only captured in the Karaf console so can very easily be missed.
Anyway, I’ll take a look at the log. Probably there is some sort of comms issue with the stick - we miss frame responses, and the handler isn’t removing messages from the queue. I did check for this, but it’s one scenario at least.
… On 11 Jan 2018, at 09:10, weakfl ***@***.***> wrote:
One more observation: first I wasn't sure if it was just bad luck, but it has happened again now:
The queue will start to grow pretty fast as soon as I set the log level from DEBUG to ERROR. I had noticed that behaviour yesterday when I turned debug logging off and now it happened again. Queue going up to 7 in like 30 minutes.
Is it possible that the supposed thread sync issue has a bigger impact when debug logging is off? I'm not sure how to log this though???
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#57 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_kQ4CoDWSPDOqydvF9jODYti0WCk96ks5tJc_wgaJpZM4Q9Ne1>.
|
Whatever it is, it makes a huge difference if logging is turned off: |
I just had a quick look at your log, and there’s no real evolution of the queue at all - it starts at 4, and ends at 4. It does drop down to 3 at one point, so we could argue it’s gone up by 1 in this log, but it’s hard to really establish when that might have happened. It’s pretty normal for it to go up and down.
I’ll make another small change to logging, but it’s hard to see how to capture this if it’s basically working with logging on…
On the other hand, if it’s actually working (i.e. not locking up and becoming unresponsive) then maybe the main issue is resolved (maybe!!)
|
You're right about the log, it's probably negligible. You mentioned that the queue should never go above 3 permanently, that's why I thought it might matter.
It does become unresponsive with log level set to error within a couple of hours. But as you said, I have no idea how to catch this. I'll let it run with debug level for a few days and see what happens. |
Thought it was just a co-incidence but I get the same sort of issue. The lights have been responsive for over a full day now and still going. If I set the logging to ERROR the lights become unresponsive within the hour. |
I would say that the issue with the growing queue count was not related, but I might be wrong. At some point the queue will get to the point where it won't allow more entries, so maybe it's related. Why it's worse when debug is off, I don't know... However, I'm reasonably confident that I have fixed this so I'll do an update later and we can see if it is the case ;) ... |
Nice 👍 Just in case, with debug logging on this time: |
Devices still becoming unresponsive with 2.3.0.201801130942, log level set to error. @BClark09 can you confirm? |
Was just about to post! Almost within an hour, the devices stopped responding when logging was set to WARN. |
Please provide a log - once it happens, turn on debug so I can at least see the state.
|
I'm going to increase the log level on some log entries - just certain exceptions or "minor" errors that shouldn't really happen, and also the log about the queue length. It will allow to run with error logging, but also log some stuff so we can see if the queue is still increasing and if there are any timeouts etc. Let's see if that shows anything. I'll do that soon, but in the meantime if you have a log showing debug info after it's stopped responding, please provide it. |
Unfortunately openHAB restarted (nothing to do with this binding) before I turned on debug logging. So will try to get this to happen again in a few hours when I return home. The following is it's current state but because I'm not at home I can't say if the toggle of
|
I’ll post another version soon - I’ve found a concurrency issue so this might (!!) explain the problem.
|
Already looking good. TX Queue is constantly 1 and my lights are responding to polling updates. Will let you know how it goes when I return home! P.S Found that I can know if my lights are changing because my zwave devices are reporting Luminance changes in each room ;) |
Looking very good here too. I‘d like to let it run a bit longer, but seems to be fixed now. |
Ok, cool - thanks. What I’ll do then is tidy things up and push this into the library repo properly. Let’s give it a day or so and if there’s no further issues we might think about closing then ;)
|
Still responsive, queue still at 1. |
All working great here too! Thanks a lot @cdjackson!! |
Everything still working, I think we can close here. Thanks Chris! |
After some time devices won't respond to commands anymore. The corresponding things are still online in OH but stop working. Judging from the comments in the community thread this problem seems to affect other users as well.
Is it possible this is caused by the growing TX queue also discussed here?
Sending a command does nothing:
And the tx queue is just growing and growing:
Here is a full debug log from OH startup up to a queue count of 8, maybe that helps. Queue seems to start growing around
2017-12-10 14:55:06
:zigbee_tx_queue_growing.log.zip
The text was updated successfully, but these errors were encountered: