Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution failure on Android after connectivity changes #4028

Closed
userar opened this issue Jan 31, 2018 · 8 comments
Closed

DNS resolution failure on Android after connectivity changes #4028

userar opened this issue Jan 31, 2018 · 8 comments
Assignees

Comments

@userar
Copy link

userar commented Jan 31, 2018

What version of gRPC are you using?

1.9.0

What did you expect to see?

For the grpc channel to be able to handle Android connectivity changes (eg. from wifi to mobile data, or from no data connection to wifi).
For the resetConnectBackoff() call on a channel to successfully short-circuit the backoff timer and make it reconnect immediately when triggered from a connectivity change.

What did you do

Built a grpc channel using the OkHttpChannelBuilder. Registered an Android BroadcastReceiver against connectivity changes which calls the channels resetConnectBackoff() (as recommended in #4011).

What did you see instead

The resetConnectBackoff() being called from the broadcast receiver event (for android.net.conn.CONNECTIVITY_CHANGE) and failing to short-circuit the backoff timer. Had to wait approx 60 seconds before the channel became usable again. Reports a host name resolution failure until the 60 seconds passes.
A sleep (of a few seconds) between the connectivity change and the resetConnectBackoff() call seems to fix the issue.
Is there any way to decrease the default backoff time. It may be a useful feature in situations like this.

@ericgribkoff ericgribkoff self-assigned this Jan 31, 2018
@ericgribkoff
Copy link
Contributor

Thanks for reporting this. I'm having trouble reproducing the issue, and if you could answer a few more questions I will be better able to help diagnose the problem.

Does this issue happen reliably or is it intermittent? Is it specific to transitioning from no connection to wifi or from no connection to cellular? The backoff timer shouldn't impact switching from wifi to mobile or vice versa; let me know if this appears not to be the case.

Are you encountering this issue on an emulator or physical device? Which Android API level(s)?

That a few seconds sleep between the connectivity change and invoking resetConnectBackoff() seems to help is a bit strange. Are you sure that NetworkInfo#isConnected returns true before you invoke resetConnectBackoff()? Can you share your BroadcastReceiver implementation or other code that reproduces the issue?

@userar
Copy link
Author

userar commented Feb 1, 2018

Thanks for the quick reply. This issue happens reliably.
I have created a table to try and understand and breakdown the behaviour between transitions below (I have tested two different physical devices, one with wifi and mobile data connections, and one with a wifi only connection).
I have tested both with and without a delay before calling resetConnectBackoff().
The network info state is CONNECTED when expected to be.

Connectivity changes made* Delayed or immediate reset** Behaviour Device and API level
wifi -> mobile data -> wifi Immediate Works -> Unable to resolve host indefinitely -> Works Google Pixel XL - API 27
wifi -> mobile data -> wifi Delayed Works -> Unable to resolve host indefinitely -> Unable to resolve host until delayed reset called Google Pixel XL - API 27
mobile data -> wifi -> mobile data Immediate Works -> Works -> Unable to resolve host indefinitely Google Pixel XL - API 27
mobile data -> wifi -> mobile data Delayed Works -> Works -> Unable to resolve host indefinitely Google Pixel XL - API 27
wifi -> off -> wifi Immediate Works -> Unable to resolve (expected) -> Unable to resolve host for approx 1 minute Google Pixel XL - API 27
wifi -> off -> wifi Delayed Works -> Unable to resolve (expected) -> Unable to resolve host until delayed reset called Google Pixel XL - API 27
mobile data -> off -> mobile data Immediate Works -> Unable to resolve (expected) -> Works Google Pixel XL - API 27
mobile data -> off -> mobile data Delayed Works -> Unable to resolve (expected) -> Unable to resolve host until delayed reset called Google Pixel XL - API 27
--- --- --- ---
wifi -> off -> wifi Immediate Works -> Unable to resolve (expected) -> Unable to resolve host for approx 1 minute Motorola Moto G - API 23
wifi -> off -> wifi Delayed Works -> Unable to resolve (expected) -> Unable to resolve host until delayed reset called Motorola Moto G - API 23

* Connection attempts made after each connectivity change.
** 5 Second delay used in this example.

Example of the BroadcastReceiver class (context registered):

public class OurConnectivityChangeReceiver extends BroadcastReceiver
{
    private final GRPCChannelManager grpc_channel_manager;
    
    public OurConnectivityChangeReceiver(GRPCChannelManager grpc_channel_manager)
    {
        this.grpc_channel_manager = grpc_channel_manager;
    }
    
    @Override
    public void onReceive(Context context, Intent intent)
    {
        //optional delay instead of direct call
        /*new Handler().postDelayed(new Runnable()
        {
            @Override
            public void run()
            {
                grpc_channel_manager.ResetManagedChannel();
            }
        }, 5000);*/
        
        /*Added to check connection state. active_network_info.getState() is CONNECTED when 
        expected (wifi and mobile connections)*/
        ConnectivityManager connection_manager = (ConnectivityManager) context
                .getSystemService(Context.CONNECTIVITY_SERVICE);
        NetworkInfo active_network_info = connection_manager.getActiveNetworkInfo();
        
        //remove if optional delay used instead
        grpc_channel_manager.ResetManagedChannel();
        
        ...
    }

Example of the channel manager service:

public class GRPCChannelManagerService implements GRPCChannelManager
{
    ...

    @Override
    public synchronized ManagedChannel GetManagedChannel()
    {
        try
        {
            if (this.managed_channel != null)
            {
                return this.managed_channel;
            }
            this.managed_channel = OkHttpChannelBuilder
                    .forAddress(settings.host, settings.grpc_port)
                    .connectionSpec(ConnectionSpec.MODERN_TLS)
                    .sslSocketFactory(ssl_context_factory.CreateSslContext().getSocketFactory())
                    .intercept(new ClientInterceptorImpl(credentials))
                    .build();
            return this.managed_channel;
        }
        catch (Exception e)
        {
            //Exception handling
        }
    }
    
    @Override
    public synchronized void ResetManagedChannel()
    {
        if(this.managed_channel != null)
        {
            this.managed_channel.resetConnectBackoff();
        }
    }
}

@ericgribkoff
Copy link
Contributor

Interesting. Thanks for the very detailed additional information! I was able to reproduce this issue on about 1 in 10 attempts (also on a Pixel XL @ API level 27) upon switching from wifi to mobile.

It seems that an immediate attempt at DNS resolution (via InetAddress.getAllByName) can fail on Android even if the BroadcastReceiver for the CONNECTIVITY_ACTION receives a network where isConnected() returns true. This can be observed without gRPC, just an attempt to resolve an InetAddress in response to the broadcast notification. This behavior is surprising, at least to me, given that the Javadoc on isConnected() states that a return value of true "Indicates whether network connectivity exists and it is possible to establish connections and pass data."

I'll look into this more tomorrow. I plan to have a PR out soon (hopefully this week) switching our DnsNameResolver to using exponential backoff (issue #3685). This would somewhat serve as an as-needed replacement for the manual delay you performed in your experiments - the first resolution attempt may still fail, but the initial backoff attempts would occur in short succession and the channel would reconnect quickly.

So the exponential backoff on resolution should roughly serve as a stopgap solution to the problem reported here, but I'm curious if I'm misapplying the signals received from the OS about the connectivity state and the expectation that DNS resolution can succeed. API level 21 added a NetworkRequest API that allows for finer grained notifications about the network state and capabilities; it may be that something there will be the correct signal to reset the connection backoff and retry DNS resolution, but I'll need to investigate more.

@ericgribkoff
Copy link
Contributor

I dug into this a bit more, and it is indeed the case that DNS resolution can/will sometimes fail even when the device reports its network status as connected. Other than differing network conditions, I'm not sure why @userar would be experiencing this failure so consistently, as it's far less frequent when I test it myself (attempting to resolve the gRPC interop server hostname).

The previously mentioned exponential backoff in gRPC's DNS resolution will mitigate this problem. We (gRPC) may also be able to behave a bit better when DNS resolution fails (as on the network switch) but we still have a previously known-good resolved address from before the connection change. I'll need to investigate if this is a viable approach - it may be that the network switch should automatically invalidate any previously resolved addresses.

@userar
Copy link
Author

userar commented Feb 2, 2018

Thanks for looking in to this further. I have also been doing a bit more investigation on this. I now believe that I was getting consistent failures because of two separate issues. In the onReceive of our BroadcastReceiver I also had network binding commands (due to the app performing comms over wifi connections to devices with no internet connection). Removal of these bindings resolves the consistent failures and I now only get the 1 in 10 (approx) DNS resolution failures that you too are observing.

For now, I think I will write some work-around code in our onReceive while alternative solutions are investigated.

@zpencer
Copy link
Contributor

zpencer commented Feb 28, 2018

@ericgribkoff is this one still applicable?

@ericgribkoff
Copy link
Contributor

This will be resolved when #4105 is merged.

@ericgribkoff
Copy link
Contributor

Update: #4105 changes the behavior of the gRPC library to use exponential backoff on dns resolution failures. This fixes the issue reported here, as the channel will recover ~immediately from a momentary failure in the dns resolver. This behavior change is automatic and doesn't require any user action to enable. The fix is in master now and will be in the upcoming gRPC Java 1.11.0 release.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants