Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry configuration #655

Closed
abedwardsw opened this issue Jan 9, 2019 · 22 comments
Closed

retry configuration #655

abedwardsw opened this issue Jan 9, 2019 · 22 comments

Comments

@abedwardsw
Copy link

App build number: 2.1.2 (21201)
Android version: 8.0.0
Device: Samsung S8+

Hi there, I am using Owntracks with openhab for presense detection. So i have it configured to point to myopenhab.org. On occasion when I glance at the notification on Android I will see that the status shows a HTTP 500 error. Upon seeing this I will go and hit the "report" button which will force a retry and then clear the error. I was wondering if it would be possible to have owntracks retry automatically a couple times at least in this case, as well as perhaps every x minutes thereafter during a failure scenario?

thanks.

@jpmens
Copy link
Member

jpmens commented Jan 9, 2019

A HTTP status code 500 is, as you likely know, a server failure (i.e. myopenhab in your case).

The question is how often should we retry? Once, thrice, seventeen times, ... ? We also have to be (at least a little) conservative with data volumes, and retrying on the 500 might even be construed as a DoS.

It's difficult.

What you might at least try in parallel is to talk to our friends at openHAB and see if they can say why the 500s are thrown.

@abedwardsw
Copy link
Author

hey @jpmens thanks for the reply. Yes I also am investigating the openhab side, but as i'm sure you are aware, no system is in fact perfect and there are sometimes errors, in this scenario a retry would be helpful. So a reasonable number of retries I think would be useful.

In my opinion sensible defaults which could be overriden would be something like the following. Given the relaxed values, I do not think they would be treated as DoS:

  • retry_sleep: 30 seconds
  • retry_max: 10 times
  • exponential_retry: 30 seconds (increase retry by this amount of seconds each time, unless reached max_retry)
  • max_retry_sleep: 120 seconds

@abedwardsw
Copy link
Author

hmm, thinking more, maybe a simpler/better implementation would be that a thread could wakeup every x minutes and check if the last request was errored/failed, and then retry it. might be much simpler to implement and better in the longrun.

@binarybucks
Copy link

Implementing that kind of retry logic generally doesn't work well on a mobile device. Things that can be done in the background are rather limited.

Additionally, the app is sending location data and not nuclear missile control commands.
If the backend doesn't accept messages for $REASON, its generally not worth the effort to do complicated retry handling in my opinion. A better idea would be to fix the faulty backend.

@abedwardsw
Copy link
Author

well, I guess I do see having reliable communication as important. For my use case I use owntracks as part of a home presence solution.

What I have noticed in the last couple days, I am still going to debug more but it seems to be the case.

For me, I live in a high rise condo, so when I come home I park in the underground. I have been monitoring and believe I notice that when I am underground (with no real cell signal) and owntracks tries to send the http request, I get the 500. As soon as i’m back in my condo and have cell/wifi, hit “report” and the 500 error goes away. I also tried going into the elevator where there is also no signal and hitting the “report” button and then the 500 error will return

Can you comment on the following please:

  • When a region transition notification is missed due to a http error, will that ever get resent or is it completely lost?
  • What is the expected behavior when the network is offline temporarily? As soon as network comes online will you detect this and try to send statuses/transitions? Are all region transition messages lost?

@binarybucks
Copy link

binarybucks commented Jan 10, 2019

When an error occurs (5xx,4xx) the message is discarded. This is also true for transition messages.

When there is no connection and the endpoint cannot be reached, delivery is retried until it can be delivered or a 4xx,5xx error is returned

Delivery is not retired on network changes. When a new message shall be published it’s added to the queue and the head is processed until it is empty

@abedwardsw
Copy link
Author

thank you @binarybucks , that helps with the understanding.

do you know why a 500 might be reported when there is no signal? it seems in this case the http error code that is returned is not what we are expecting. I would like to have that retry logic in this case so I don't lose the data. I guess it's somehow related to the fact that you still have some bars and a theoretical connection when in the elevator, but it just doesn't work I guess.

Do you have a recommendation on any improvements that could be made to not lose the messages in this case? As my daily routine involves going into the underground where there is no signal I assume this happens most days which now makes sense with the behavior I was seeing.

  • Perhaps a easier change might be to make the retrieable error codes configurable so that I could mark the 500 as a connection error instead of a throwaway?
  • or perhaps upon a 500 error you make a call to some other web service health check to see if you also get a offline/500 for that one, if so they you assume you are offline and do not discard? Perhaps adding a configurable health check url?

any ideas? are you willing to help me improve so I don't lose the messages in this use case? would really appreciate it as this is my daily routine ;-)

@abedwardsw
Copy link
Author

can you also consider re-opening the case, I can change the description to something like transition events are lost when cell signal is weak?

@jpmens
Copy link
Member

jpmens commented Jan 10, 2019

Transition events are surely not lost when cell signal is weak. The issue you're experiencing here is that your backend service is responding with a code 500. As @binarybucks clearly said above, a 4xx or 5xx code causes the app to discard the POST. When there's no signal, messages are queued, and even on a weak signal we can deliver if the endpoint behaves.

If you are able to reproduce this on an endpoint which doesn't fail you (i.e. something you control and/or for which you can produce logs) then we'll revisit.

@binarybucks
Copy link

You can also generate a debug log as shown in the booklet and send it to me

@abedwardsw
Copy link
Author

hi, thanks @binarybucks that is great. I will do that. I can easily reproduce now that I know the steps. I simply took the elevator to my underground and then hit the "report" button, as expected a 500 error comes up. I'll create a debug log later and have a look first and then send it to you.

I can find another endpoint out there to send to if needed, but given that it seems reproduceable so easily by just going into my underground parking and hitting report, I don't really think that is needed as to me this shows its not related to a faulty endpoint but network connection. @binarybucks please let me know if you need me to test with another endpoint as well.

@binarybucks binarybucks reopened this Jan 10, 2019
@growse
Copy link
Collaborator

growse commented Jan 10, 2019

I simply took the elevator to my underground and then hit the "report" button, as expected a 500 error comes up.

Can I just double-check my understanding - when you have no signal or connectivity, the app says there's a '500 error'? Or does your device report a signal / bars, but you know that traffic is unlikely due to the location?

@abedwardsw
Copy link
Author

I never get reception in my elevator and underground, the walls/concrete/metal block the signal. Typically there is a bar or two, then after say a minute or so I'll see android realize this and switch to say "Network Unavailable". I'll run some more tests later, but I believe when I ran my test earlier today even if I hit "Submit report" after the "Network Unavailable" message appeared on my android, the 500 status still showed in the OwnTracks notification panel. I am not sure if when I hit "Submit report" and the network is unavailable if Owntracks actually would update the message at that point or just wait for network to become online before actually trying again.

I am just guessing, but I guess this issue occurs in the minute or so when the network is not reachable, but android hasn't yet detected this and switched to "Network Unavailable" status. I guess somewhat related to issue-642 although not really.. I guess fixing that issue still wouldn't solve my problem. I guess we first need to understand why does a 500 get returned instead of a say a 408, 503 or something like that and what the expected behavior is from android and/or the client lib you use in this scenario.

Later today, i'll update my endpoint to something that captures/logs requests (will find some free service out there), and then enable debug logging in owntracks and then go into the elevator/parkade and try some requests. If you think of any other scenario I can do/try which will help gather more information, just let me know.

I do really appreciate you giving me a hand here. I really do love OwnTracks, and now with the addition of GPSTracker from openhab, things typically work very well for me. It just doesn't work well when I miss/lose the transition events.

@abedwardsw
Copy link
Author

just to put a quick update here. I spent some more time trying to reproduce the issue. Unfortunately now I can't seem to reproduce the issue, either with a dummy url (mockable.io) or with myopenhab. Now it's just saying "Error" when offline instead of the "Error 500" which I assume is the expected behavior. i'm going to stop for today otherwise people might wonder why I spend my evening travelling up and down the elevators ;-) Thank you all very much for your help so far. I guess at this point i'll just leave debugging enabled on owntracks for a while and see if I can pinpoint how to reproduce and have a look at the logs myself first if I can reproduce. If you want you can close this until I am able to gather more information.

I did notice that it does take some time for the queue to be cleared once network comes back online. assuming issue-642 might help in this case, but just an observation and not part of this ticket.

@binarybucks
Copy link

binarybucks commented Jan 11, 2019

I've been doing some tests for #656. I've noticed that messages are also queued if if the backend returns a 4xx,5xx status code. This is acutally an error in the code but makes your behavior even more strange.

I did notice that it does take some time for the queue to be cleared once network comes back online. assuming issue-642 might help in this case, but just an observation and not part of this ticket.

Yes, the queue is currently processed when a new message should be send.
It basically works by adding the new message to the end of the queue and then send everything in the queue until it is empty or a delivery fails.

@binarybucks
Copy link

@abedwardsw were you able to gather the debug log so I can have a look?

@abedwardsw
Copy link
Author

abedwardsw commented Jan 17, 2019

hi @binarybucks , I am attaching my logs for the last week. Unfortunately I haven't seen the 503 issue with missing transitions since my initial reporting. I have looked through the logs myself first. I do see some exceptions related to DNS resolution HTTP code 502 (for example see timestamp Tue Jan 15 2019 at 12:31:07:727 AM). I did notice these on occasion in the notification bar for owntracks where it would say HTTP 502.

I checked for missing transition events, they appear to have gone through fine, so I think everything's working well. Not sure if the dns resolution issue should be handled/catched or

thank you very much for helping, I don't want to waste any more of your time. If you want, We can close this issue and then re-open once (if) I can reproduce the missing transition issue with debug logging on.

If you think there's something useful in my logs let me know, thanks.

@binarybucks
Copy link

Thank you. I’ll have a look. I’ve removed the link to the files because they contained reverse geocoding data with readable locations.

@abedwardsw
Copy link
Author

hi @binarybucks i've been monitoring for a few more days. at this point I believe I was incorrect about the missing transition events. What I believe is really happening is captured in the logs I sent. It's the HTTP 502 (java.net.UnknownHostException) which is the dns resolution error that I see, most likely due to temporary internet loss on my android device while going in the elevator/underground.

the only other oddity I see (if it matters), is that sometimes in the logs when we get the "java.net.UnknownHostException" we see "message:null," log entry right before whereas other times we see "message:HTTP code 502". Not sure if this matters in your error handling and if either are handled differently.

The events are NOT lost, and they do go through on the next sync, it's just that sometimes it takes a while before owntracks initiates the next sync. Since i'm using this for home automation it would be ideal to get these events sooner after a transmission error. Therefore it it my opinion/thought that if you implemented issue-642 and captured the network coming back online event and then immediately at that time flush the queue it would improve my situation to be really great ;-)

if you see anything different please let me know, but I believe this is the case. As mentioned in last comment, i'm not sure, maybe it makes sense to capture/trap the DNS resolution error and assume that you are offline at that time. Maybe they already are, not sure as i'm not familiar with the code.
although I guess would have to differentiate from an actual dns resolution issue vs one that is due to the network being offline.

thank you for taking the time to read my comment.

@binarybucks
Copy link

Thank you for doing the troubleshooting and all the testing. I’ve also been playing around a bit and abused the elevator at work quite a bit (yes, there were questions what I’m doing with a phone and a laptop in there).

Apparently the error handling of our HTTP library is a bit strange. Calls to hosts that fail due to a missing network connection are wrongly returned with a 5xx HTTP code. They’re requeued but that explains all the 5xx errors.

I will see if I can add a faster reconnect job based on the WorkManager to retry queue processing when a connection is available again.

Things are a bit busy at the moment so please excuse if that might take a bit.

@abedwardsw
Copy link
Author

no worries at all, i'm just glad to see were on the same page, thank you very much for the great software.

@binarybucks
Copy link

Handled in #656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants