New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retry configuration #655
Comments
A HTTP status code 500 is, as you likely know, a server failure (i.e. myopenhab in your case). The question is how often should we retry? Once, thrice, seventeen times, ... ? We also have to be (at least a little) conservative with data volumes, and retrying on the 500 might even be construed as a DoS. It's difficult. What you might at least try in parallel is to talk to our friends at openHAB and see if they can say why the 500s are thrown. |
hey @jpmens thanks for the reply. Yes I also am investigating the openhab side, but as i'm sure you are aware, no system is in fact perfect and there are sometimes errors, in this scenario a retry would be helpful. So a reasonable number of retries I think would be useful. In my opinion sensible defaults which could be overriden would be something like the following. Given the relaxed values, I do not think they would be treated as DoS:
|
hmm, thinking more, maybe a simpler/better implementation would be that a thread could wakeup every x minutes and check if the last request was errored/failed, and then retry it. might be much simpler to implement and better in the longrun. |
Implementing that kind of retry logic generally doesn't work well on a mobile device. Things that can be done in the background are rather limited. Additionally, the app is sending location data and not nuclear missile control commands. |
well, I guess I do see having reliable communication as important. For my use case I use owntracks as part of a home presence solution. What I have noticed in the last couple days, I am still going to debug more but it seems to be the case. For me, I live in a high rise condo, so when I come home I park in the underground. I have been monitoring and believe I notice that when I am underground (with no real cell signal) and owntracks tries to send the http request, I get the 500. As soon as i’m back in my condo and have cell/wifi, hit “report” and the 500 error goes away. I also tried going into the elevator where there is also no signal and hitting the “report” button and then the 500 error will return Can you comment on the following please:
|
When an error occurs (5xx,4xx) the message is discarded. This is also true for transition messages. When there is no connection and the endpoint cannot be reached, delivery is retried until it can be delivered or a 4xx,5xx error is returned Delivery is not retired on network changes. When a new message shall be published it’s added to the queue and the head is processed until it is empty |
thank you @binarybucks , that helps with the understanding. do you know why a 500 might be reported when there is no signal? it seems in this case the http error code that is returned is not what we are expecting. I would like to have that retry logic in this case so I don't lose the data. I guess it's somehow related to the fact that you still have some bars and a theoretical connection when in the elevator, but it just doesn't work I guess. Do you have a recommendation on any improvements that could be made to not lose the messages in this case? As my daily routine involves going into the underground where there is no signal I assume this happens most days which now makes sense with the behavior I was seeing.
any ideas? are you willing to help me improve so I don't lose the messages in this use case? would really appreciate it as this is my daily routine ;-) |
can you also consider re-opening the case, I can change the description to something like transition events are lost when cell signal is weak? |
Transition events are surely not lost when cell signal is weak. The issue you're experiencing here is that your backend service is responding with a code 500. As @binarybucks clearly said above, a 4xx or 5xx code causes the app to discard the POST. When there's no signal, messages are queued, and even on a weak signal we can deliver if the endpoint behaves. If you are able to reproduce this on an endpoint which doesn't fail you (i.e. something you control and/or for which you can produce logs) then we'll revisit. |
You can also generate a debug log as shown in the booklet and send it to me |
hi, thanks @binarybucks that is great. I will do that. I can easily reproduce now that I know the steps. I simply took the elevator to my underground and then hit the "report" button, as expected a 500 error comes up. I'll create a debug log later and have a look first and then send it to you. I can find another endpoint out there to send to if needed, but given that it seems reproduceable so easily by just going into my underground parking and hitting report, I don't really think that is needed as to me this shows its not related to a faulty endpoint but network connection. @binarybucks please let me know if you need me to test with another endpoint as well. |
Can I just double-check my understanding - when you have no signal or connectivity, the app says there's a '500 error'? Or does your device report a signal / bars, but you know that traffic is unlikely due to the location? |
I never get reception in my elevator and underground, the walls/concrete/metal block the signal. Typically there is a bar or two, then after say a minute or so I'll see android realize this and switch to say "Network Unavailable". I'll run some more tests later, but I believe when I ran my test earlier today even if I hit "Submit report" after the "Network Unavailable" message appeared on my android, the 500 status still showed in the OwnTracks notification panel. I am not sure if when I hit "Submit report" and the network is unavailable if Owntracks actually would update the message at that point or just wait for network to become online before actually trying again. I am just guessing, but I guess this issue occurs in the minute or so when the network is not reachable, but android hasn't yet detected this and switched to "Network Unavailable" status. I guess somewhat related to issue-642 although not really.. I guess fixing that issue still wouldn't solve my problem. I guess we first need to understand why does a 500 get returned instead of a say a 408, 503 or something like that and what the expected behavior is from android and/or the client lib you use in this scenario. Later today, i'll update my endpoint to something that captures/logs requests (will find some free service out there), and then enable debug logging in owntracks and then go into the elevator/parkade and try some requests. If you think of any other scenario I can do/try which will help gather more information, just let me know. I do really appreciate you giving me a hand here. I really do love OwnTracks, and now with the addition of GPSTracker from openhab, things typically work very well for me. It just doesn't work well when I miss/lose the transition events. |
just to put a quick update here. I spent some more time trying to reproduce the issue. Unfortunately now I can't seem to reproduce the issue, either with a dummy url (mockable.io) or with myopenhab. Now it's just saying "Error" when offline instead of the "Error 500" which I assume is the expected behavior. i'm going to stop for today otherwise people might wonder why I spend my evening travelling up and down the elevators ;-) Thank you all very much for your help so far. I guess at this point i'll just leave debugging enabled on owntracks for a while and see if I can pinpoint how to reproduce and have a look at the logs myself first if I can reproduce. If you want you can close this until I am able to gather more information. I did notice that it does take some time for the queue to be cleared once network comes back online. assuming issue-642 might help in this case, but just an observation and not part of this ticket. |
I've been doing some tests for #656. I've noticed that messages are also queued if if the backend returns a 4xx,5xx status code. This is acutally an error in the code but makes your behavior even more strange.
Yes, the queue is currently processed when a new message should be send. |
@abedwardsw were you able to gather the debug log so I can have a look? |
hi @binarybucks , I am attaching my logs for the last week. Unfortunately I haven't seen the 503 issue with missing transitions since my initial reporting. I have looked through the logs myself first. I do see some exceptions related to DNS resolution HTTP code 502 (for example see timestamp Tue Jan 15 2019 at 12:31:07:727 AM). I did notice these on occasion in the notification bar for owntracks where it would say HTTP 502. I checked for missing transition events, they appear to have gone through fine, so I think everything's working well. Not sure if the dns resolution issue should be handled/catched or thank you very much for helping, I don't want to waste any more of your time. If you want, We can close this issue and then re-open once (if) I can reproduce the missing transition issue with debug logging on. If you think there's something useful in my logs let me know, thanks. |
Thank you. I’ll have a look. I’ve removed the link to the files because they contained reverse geocoding data with readable locations. |
hi @binarybucks i've been monitoring for a few more days. at this point I believe I was incorrect about the missing transition events. What I believe is really happening is captured in the logs I sent. It's the HTTP 502 (java.net.UnknownHostException) which is the dns resolution error that I see, most likely due to temporary internet loss on my android device while going in the elevator/underground. the only other oddity I see (if it matters), is that sometimes in the logs when we get the "java.net.UnknownHostException" we see "message:null," log entry right before whereas other times we see "message:HTTP code 502". Not sure if this matters in your error handling and if either are handled differently. The events are NOT lost, and they do go through on the next sync, it's just that sometimes it takes a while before owntracks initiates the next sync. Since i'm using this for home automation it would be ideal to get these events sooner after a transmission error. Therefore it it my opinion/thought that if you implemented issue-642 and captured the network coming back online event and then immediately at that time flush the queue it would improve my situation to be really great ;-) if you see anything different please let me know, but I believe this is the case. As mentioned in last comment, i'm not sure, maybe it makes sense to capture/trap the DNS resolution error and assume that you are offline at that time. Maybe they already are, not sure as i'm not familiar with the code. thank you for taking the time to read my comment. |
Thank you for doing the troubleshooting and all the testing. I’ve also been playing around a bit and abused the elevator at work quite a bit (yes, there were questions what I’m doing with a phone and a laptop in there). Apparently the error handling of our HTTP library is a bit strange. Calls to hosts that fail due to a missing network connection are wrongly returned with a 5xx HTTP code. They’re requeued but that explains all the 5xx errors. I will see if I can add a faster reconnect job based on the WorkManager to retry queue processing when a connection is available again. Things are a bit busy at the moment so please excuse if that might take a bit. |
no worries at all, i'm just glad to see were on the same page, thank you very much for the great software. |
Handled in #656 |
App build number: 2.1.2 (21201)
Android version: 8.0.0
Device: Samsung S8+
Hi there, I am using Owntracks with openhab for presense detection. So i have it configured to point to myopenhab.org. On occasion when I glance at the notification on Android I will see that the status shows a HTTP 500 error. Upon seeing this I will go and hit the "report" button which will force a retry and then clear the error. I was wondering if it would be possible to have owntracks retry automatically a couple times at least in this case, as well as perhaps every x minutes thereafter during a failure scenario?
thanks.
The text was updated successfully, but these errors were encountered: