Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP outgoing queue gets stuck after connection error #656

Closed
MikeTheTux opened this issue Jan 11, 2019 · 48 comments
Closed

HTTP outgoing queue gets stuck after connection error #656

MikeTheTux opened this issue Jan 11, 2019 · 48 comments
Assignees
Labels

Comments

@MikeTheTux
Copy link

MikeTheTux commented Jan 11, 2019

Hi,

I’m using Regions in OwnTracks for GeoFencing in OpenHAB via a HTTPs-Server. OwnTracks is switched to “Manuel Mode”, means Region Updates only (Enter/Exit).

Usually I’m leaving my Home-Region via car. The car takes the SIM of the mobile via Bluetooth (rSAP) and - after setting up the internet connection - provides a data connection back to the mobile via WiFi. During this transition, there is a short phase (~1 Minute), where the WiFi is already available and connected, but without internet access! Exactly during this time OwnTracks wants to publish the Exit Event. The DNS name of my Server cannot be resolved. An error shows up in the OwnTracks status and the event is lost - OwnTracks doesn’t retry it.

Is it possible to implement a kind of retry mechanism, especially in the case that GeoFencing (Manual Mode with Regions) is used? In this case a lost message is much more critical that in the normal tracking mode.
Do you know any workaround except from doing the Region calculation in OpenHAB instead of OwnTracks?

Thanks and kind regards
michael

  • App build number: 2.1.2 (21201)
  • Android version: 8.0.0
  • Device: Samsung Galaxy S8 edge
@binarybucks
Copy link

We're investigating something similar in the other issue but so far I've been unable to reproduce it.
Can you check if the message is really dropped in your case and not just delayed a long time until the next message is send? You can easily see the number of queued messages by going to the status page
grafik

@MikeTheTux
Copy link
Author

You are right. The event is not lost. It remains in the queue ✔
Unfortunately OwnTracks does not retry the transmission, as soon as the WiFi receives the internet connection (WLAN Symbol changes from ❗ to OK). A retry is done when it changes back from WLAN to Cellular.
This means that the Exit Event is delayed for the duration of the car drive 😒

thanks, michael

@MikeTheTux
Copy link
Author

today it did not recover from the DNS error at all. the problem remained, even after opening and closing the host settings, until I changed the mode to Significant location change / Move mode.

@binarybucks
Copy link

Hard to say what happend without further information. Could you enable debug logging and see if the issue occurs again?

@MikeTheTux
Copy link
Author

Logging Enabled. I'll come back with a logfile

@MikeTheTux
Copy link
Author

What if I say that the issue didn't occur since I enabled logging? Without logging it was reproduceable.

@binarybucks
Copy link

binarybucks commented Jan 17, 2019

I wouldn't believe you if you said that ;)

@jpmens
Copy link
Member

jpmens commented Jan 17, 2019

Great stuff: solution: leave logging enabled. Shall I close this ticket now? (Just kidding :-)

@sfjes
Copy link

sfjes commented Jan 22, 2019

I've encountered a couple times now what seems like a similar situation. I'll find that my location is not being reported (via HTTPS) and in looking at the status window, I see:

Endpoint state: Error
Endpoint state message: java.net.UnknownHostException: Unable to resolve host "my.server.here": No address associated with hostname
Endpoint queue: 272

Both times I've presumably gone through an area without connectivity, the problem is that it seems stuck in that same state after connectivity is restored, instead of the usual return to normal + uploading the queue.

Despite being able to lookup/reach the host with a browser, Owntracks continues to increment the Endpoint queue # while showing the same error. I've had this persist for days with thousand of endpoints.

Restarting Owntracks clears the error and it starts reporting again right away - unfortunately all the queued messages seem to be thrown away on restart. I turned on debug logging but haven't been able to artificially reproduce the error so far (by toggling airplane mode etc).

App build number: 2.1.2 (21201)
Android version: 7.0
Device: LG V20

@binarybucks
Copy link

I've noticed the issue several times in the past but I've been unable to reproduce it reliably. At some point it seems that it can no longer resolve adresses and hence fails to send new messages.
It might be caused by Android Doze mode that prevents background network access for certain apps after longer times of inactivity. Could you check if the app is whitelisted?

@sfjes
Copy link

sfjes commented Jan 22, 2019

The app is whitelisted for protection against battery optimizations (battery optimization whitelisted: true), however it's not whitelisted for "unrestricted mobile data access", which allows background data usage when Data Saver mode is enabled and using cellular connectivity.

In these two cases the Data Saver feature was indeed enabled at some points. As mentioned it continues to affect the app even when on wifi, which prevents the expected uploading of the queue. I wasn't able to reproduce it even with Data Saver enabled, but I also didn't wait for extended periods so perhaps Android hadn't fully restricted it during my tests.

Pending a fix or workaround (other than "don't use Data Saver", if that's indeed the cause), it'd be nice to be able to preserve the queue when restarting the app, so that no data is lost, just delayed. Is there a way to do that?

@binarybucks
Copy link

not at the moment

@sfjes
Copy link

sfjes commented Jan 23, 2019

I found today that the issue still occurs even if OwnTracks is whitelisted to allow background mobile data usage while in Data Saver mode (with Data Saver enabled).

I restarted the app, confirmed it was able to post updates, then disconnected from wifi & traveled a bit (in Move mode). The first java.net. UnknownHostException happens within about a minute and the updates begin to queue, with repeated occurrences of the same exception. ~21 minutes later there's 104 messages in the queue. Back on the wifi, not that it should matter since the app was whitelisted for background data access, but still getting unable to resolve host, have to restart the app to get updates working again.

Tue Jan 22 2019 at 03:21:04:884 PM : error:IOException. Delivery failed java.net.UnknownHostException: Unable to resolve host "example.com": No address associated with hostname at java.net.Inet6AddressImpl.lookupHostByName(Inet6AddressImpl.java:130

I notice the next exception error says the endpoint state lists HTTP code 400, which doesn't really jive with being unable to resolve the host:
Tue Jan 22 2019 at 03:21:22:794 PM : endpoint state changed Error. Message: HTTP code 400 java.net.UnknownHostException: Unable to resolve host "example.com": No address associated with hostname at java.net.Inet6AddressImpl.lookupHostByName(Inet6AddressImpl.java:130)

Here's the full last set of messages from the app before restarting to fix it:

  Tue Jan 22 2019 at 03:48:03:344 PM :    getting first message from queue: 1548199259727

  Tue Jan 22 2019 at 03:48:03:365 PM :    url:https://example.com/api/publish, userInfo:null, messageId:1548199259727

  Tue Jan 22 2019 at 03:48:03:506 PM :    request was not successful. HTTP code 400

  Tue Jan 22 2019 at 03:48:03:511 PM :    message:HTTP code 400,

  Tue Jan 22 2019 at 03:48:03:522 PM :    endpoint state changed Error. Message: HTTP code 400 java.net.UnknownHostException: Unable to resolve host "example.com": No address associated with hostname at java.net.Inet6AddressImpl.lookupHostByName(Inet6AddressImpl.java:130) at java.net.Inet6AddressImpl.lookupAllHostAddr(Inet6AddressImpl.java:74) at java.net.InetAddress.getAllByName(InetAddress.java:752) at okhttp3.Dns$1.lookup(Dns.java:40) at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:185) at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:149) at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:84) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:214) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.java:156) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:254) at okhttp3.RealCall.execute(RealCall.java:92) at org.owntracks.android.services.MessageProcessorEndpointHttp.sendMessage(MessageProcessorEndpointHttp.java:197) at org.owntracks.android.services.MessageProcessorEndpointHttp.processOutgoingMessage(MessageProcessorEndpointHttp.java:255) at org.owntracks.android.messages.MessageLocation.processOutgoingMessage(MessageLocation.java:176) at org.owntracks.android.messages.MessageBase.run(MessageBase.java:111) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1133) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:607) at java.lang.Thread.run(Thread.java:761) Caused by: android.system.GaiException: android_getaddrinfo failed: EAI_NODATA (No address associated with hostname) at libcore.io.Posix.android_getaddrinfo(Native Method) at libcore.io.ForwardingOs.android_getaddrinfo(ForwardingOs.java:55) at java.net.Inet6AddressImpl.lookupHostByName(Inet6AddressImpl.java:111) ... 32 more

@binarybucks
Copy link

The error code 400 is indeed strange but so is the 500 that sometimes appears.
Is the DNS error only happening when data saver is on or also when it is disabled system wide?

@sfjes
Copy link

sfjes commented Jan 30, 2019

Since turning off Data Saver mode, I've seen it get stuck on again Unable to resolve host "...": No address associated with hostname. It also got stuck yesterday on the below error with ~3000 updates queued up. As before, doing a manual request or toggling airplane mode doesn't get it unstuck. So in this case it seems more like a general inability to recover from network errors in some cases, as opposed to specific to Data Saver or the hostname resolution failure.

Mon Jan 28 2019 at 03:52:12:516 PM :error:IOException. Delivery failed java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)

(3000+ backlogged messages makes for very large debug log files very fast 😄 )

@binarybucks
Copy link

Ah yes, I'll remove logging of queue content again, now that we ruled out that queue processing is the issue ;)

I've disabled caching for the HTTP client explicitly in the next build. Maybe it caches the error :s

@binarybucks
Copy link

Just a quick question, but do you get an IPv6 address from your mobile provider or is there only a IPv4/IPv6 record for your target address?

Maybe the route selector defaults to IPv6 but there's only IPv4 connection.

@sfjes
Copy link

sfjes commented Jan 30, 2019

My HTTP target address has only an IPv4 DNS record. I believe my mobile provider does provide IPv6 connectivity by default (T-Mobile), with IPv4 NAT. On wifi I have only IPv4 connectivity. OwnTracks works on both networks normally (and always after being restarted).

@binarybucks
Copy link

binarybucks commented Jan 30, 2019

Can you try to reset the HTTP client when it gets stuck the next time?
Just change the mode to MQTT mode and without further configuration, change it back to HTTP after a short time. This removes all queued messages but recreates the client.

Afterwards, send a message and see if it still is stuck.

@sfjes
Copy link

sfjes commented Jan 30, 2019

After performing the above steps, it's no longer stuck. Response 201, 1 As you said, the queue was chucked.

@binarybucks
Copy link

I think I slowly understand what's happening. The HTTP library maintains a connection pool. When there's a DNS or connection error the response is not closed correctly and shortly after all worker get stuck. Further queue processing attemps reuse the stuck HTTP worker threads.
I'll see if I can fix this

@binarybucks
Copy link

Could you install the debug APK from https://github.com/owntracks/android/releases/tag/v.2.1.3 to see if it improves anything? You can install it side by side because it uses the app identifier org.owntracks.android.debug.

I'm unfortunately unable to reproduce the behavior on my device or the emulator. When I enable airplane mode again, everything resumes fine. I might already have fixed the issue accidentially by rewriting parts of the HTTP code before though.

@sfjes
Copy link

sfjes commented Jan 30, 2019

I can give it a go, but it has a conflict trying to install side-by-side:

01-30 05:26:21.182 1652 1848 W PackageManager: Package couldn't be installed in /data/app/org.owntracks.android.debug-1 01-30 05:26:21.182 1652 1848 W PackageManager: com.android.server.pm.PackageManagerException: Can't install because provider name org.owntracks.android.workmanager-init (in package org.owntracks.android.debug) is already used by org.owntracks.android

I can install it in place of the existing one, unless it'd be preferable to resolve that so that their behavior under the same conditions can be compared.

@binarybucks
Copy link

Ah that one's picky. I've updated the release with a new hardcoded application id.

@sfjes
Copy link

sfjes commented Jan 30, 2019

01-30 05:55:52.499 1652 1848 W PackageManager: Package couldn't be installed in /data/app/org.owntracks.android.debug.debug-1
01-30 05:55:52.499 1652 1848 W PackageManager: com.android.server.pm.PackageManagerException: Can't install because provider name org.owntracks.android.workmanager-init (in package org.owntracks.android.debug.debug) is already used by org.owntracks.android

More debugs!

@binarybucks
Copy link

Next try https://alr.st/files/app-debug.apk

@sfjes
Copy link

sfjes commented Jan 31, 2019

01-31 00:42:34.852 1652 1848 W PackageManager: Package couldn't be installed in /data/app/org.owntracks.android.debug-1
01-31 00:42:34.852 1652 1848 W PackageManager: com.android.server.pm.PackageManagerException: Can't install because provider name org.owntracks.android.fileprovider (in package org.owntracks.android.debug) is already used by org.owntracks.android

I can just install the debug version alone, if having a side-by-side comparison (both at the same time) isn't a priority. Otherwise send in the next try!

@binarybucks
Copy link

I'm out of ideas unfortunately. If possible, just install the debug version alone

@sfjes
Copy link

sfjes commented Jan 31, 2019

No problem, debug version installed, will followup in a few days (hopefully). Thanks!

@binarybucks binarybucks changed the title Region Exit Events lost due to parallel transition of the mobile from Cellular to WiFi HTTP outgoing queue gets stuck after connection error Feb 1, 2019
@binarybucks binarybucks self-assigned this Feb 1, 2019
@binarybucks binarybucks added the bug label Feb 1, 2019
@sfjes
Copy link

sfjes commented Feb 2, 2019

So while out and about using the debug version, I found the app stuck with 148 queued messages and the Endpoint state message javax.net.ssl.SSLHandshakeException: SSL handshake aborted: ssl=0x750737e180: I/O error during system call, Connection reset by peer, while other apps worked fine. (No Data Saver in use.)

Once back on the home wifi, now with 241 messages queued, the Endpoint state message had actually changed to java.net.UnknownHostException: Unable to resolve host "...": No address associated with hostname

@binarybucks
Copy link

Reading through the HTTP library issue tracker on Github, I found several other reports with the same issue on Samsung devices: square/okhttp#3919. Unfortunately, there is no workaround mentioned.

I could add a custom setting that allows to circumvent system DNS settings but I'm not sure if that is wise.

Meanwhile I've created a new version that sets up a new HTTP client instance on each request. Maybe the new instances use a fresh DNS instance. Could you check if the fresh version at https://alr.st/files/app-debug.apk improves anything?

@jpmens see, it's always a DNS issue 😁

@sfjes
Copy link

sfjes commented Feb 3, 2019

I've installed the latest debug.apk and will give it a go. There was the one stuck error on the previous build which seemed to not be a DNS issue -- the SSL handshake IO Error / Connection reset by peer one.

@sfjes
Copy link

sfjes commented Feb 7, 2019

Well, I haven't gotten stuck on the "Unable to resolve host" on the latest build, but I have gotten stuck twice with java.net.SocketTimeoutException: connect timed out, unfortunately. Progress perhaps?

Also again: javax.net.ssl.SSLHandshakeException: SSL handshake aborted: ssl=0x7284c7ae00: I/O error during system call, Connection reset by peer

While I have trouble reproducing it by toggling airplane mode, turning on/off wifi, data, etc, I'd say it happens about 90% of the time when actually traveling from one place to another (if there are network type changes involved).

@sfjes
Copy link

sfjes commented Feb 11, 2019

Went on a trip and managed to go a couple days, switching occasionally back and forth between native cell data and a cellular hotspot's wifi, without getting hung up. Checked not long after arriving home and saw it stuck on java.net.NoRouteToHostException: No route to host, which makes me wonder if the trip was just good luck or if there's something particular about being near home.

Maybe the existence of a region, or related to the hostname I use having split-dns (resolves to a different IPv4 address on home wifi vs cellular - same SSL cert). I've deleted the home region for now as an experiment.

@sfjes
Copy link

sfjes commented Feb 26, 2019

For better or worse, I haven't had the problem recur in any form in the last two weeks, despite a number and in-and-outs from home. I haven't changed anything since loading the last debug build.

@m-art-in
Copy link

Version 2.1.2:

I also have
java.net.UnknowHostException: Unable to resolve host
my queue hast 96 entries and can't find my server. Any help?

@jpmens
Copy link
Member

jpmens commented Aug 29, 2019

The message Unable to resolve host looks pretty self-explanatory to me.

@ronsmits
Copy link

I am having the same issues, after not being able to connect to the receiving host with http, the queue seems to be stuck. I was reading the code and noticed that there is a switch in getHttpClient() to either re-use the or setup a new client for each request. I could not however find this setting in the client.

Maybe this happens because the httpclient itself gets stuck? If I could find a way to switch it to always create a new httpClient I could test it.

@m-art-in
Copy link

Thank you for this details. Would be nice this could be fixed.

@giejay
Copy link

giejay commented Oct 15, 2019

I can quite easily reproduce the unknown host exception followed by a pending queue by just disabling my data and WiFi connection and trying to push my location. After enabling data again, I would expect the queue to be emptied within a few minutes. Unfortunately this is not the case, it mostly takes around 20 minutes, which is too long for my unlock the door when I'm home scenario ;) I will try to implement some sort of retry mechanism to be not dependant on some (location) event to arrive for clearing the queue.

@binarybucks
Copy link

Try setting dontReuseHttpClient to true in the config editor.

@giejay
Copy link

giejay commented Oct 20, 2019

I am using that setting now for a few days, did not see the error yet, will monitor it the coming weeks. Just for my information, why would the queue be emptied if there are no new events? When I look at the code, it's all reactive, there's no process which runs every X minutes so the queue will remain as is without events.

@giejay
Copy link

giejay commented Oct 23, 2019

I was using the Move Monitoring Mode for the last few days, but because of a battery drain I changed it back to Significant Monitoring Mode. Today I was in an error state again, with a SocketTimeoutexception. It took about 35 minutes for it be retried.

Would it be possible to change the Monitoring Mode temporarily to Significant, whenever there are http requests stuck? And change it back afterwards?

@binarybucks
Copy link

Changing modes does not "unstick" the queue, hence no.
Adding a new report to the queue retries sending it starting from the queue head, hence changing to monitoring mode just gets things going again or messages are send way more often so you don't notice the problem.

If in significant location mode a location report fails, it takes until the next report to send it. This might be well over 35 minutes depending on location availability etc.

As background tasks are limited to run in certain intervals, we cannot continuously monitor for stuck messages. Your best bet is configuring ping with the lowest interval of 15m I think.

Also your assumption that the queue should be emptied after enabling data connection again is not correct. We're not monitoring any data connection hence it might again take some time until the next publish (unless there's a bug with Socket DNS resolution like there is on many Samsung devices, in which case dontReuseHttpClient might help).

@giejay
Copy link

giejay commented Oct 23, 2019

Changing modes does not "unstick" the queue, hence no.

Yes, I get that.

Adding a new report to the queue retries sending it starting from the queue head, hence changing to monitoring mode just gets things going again or messages are send way more often so you don't notice the problem.

That's exactly my point, the ping interval of 15m won't solve my door unlock issue, but if monitoring mode sends messages every 1 minute, this most likely will.

@binarybucks
Copy link

15m is the lowest interval for OS background tasks.
You could shoehorn significant mode to behave similar to move mode by setting low locatorInterval (desired interval >30s), locatorDisplacement=0 (no displacement) and locatorPriority=100 (GPS)

Ultimately you might however be using the wrong tool for the job you're trying to achieve.
The location API is resource constraint by power usage and hence there's always a trade of between reaction time and battery life.

You might have better results by using Region detection if Wifi Location is reliable in your area or some other kind of presence detection besides Owntracks.

@giejay
Copy link

giejay commented Oct 26, 2019

Thanks I'm gonna try that!

WiFi is not an option for me. Unlocking the doors with Bluetooth can sometimes taken up to a minute, if I need to be connected to my home network first before it starts unlocking doors, I'm probably gonna have to wait a lot ;)

Edit: changing locatorPriority doesn't do anything for me, its stuck at 2

@growse
Copy link
Collaborator

growse commented Nov 8, 2020

There's been a number of changes since 2.1.3 and 2.2.0, including the fact that even bringing the app to the foreground will init a high-frequency location request (similar to how gmaps works). I don't believe there's a bug on the HTTP and queue side in 2.2.0, so will close.

If you have a specific issue that's still present in the latest version, please re-open and we'll take a look.

@growse growse closed this as completed Nov 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants