fix: Use a FQDN for metadata.google.internal #117

crwilcox · 2018-10-31T21:33:12Z

Minimize lookups by not expanding search path when resolving metadata.google.internal.

A customer found that about 85 DNS requests were made at init of packages relying on gcp-metadata. Upon investigation the issue was we were forcing path expansion for DNS resolution. After this change about 21 requests are made.

Minimize lookups by not expanding search path when resolving metadata.google.internal

wyardley · 2018-10-31T22:41:49Z

It does result in slightly odd looking requests, but I think it's [appending the trailing dot] the correct behavior, otherwise, the resolver will first try metadata.google.internal prepended to every domain in the search path, e.g. (from tcpdump):

22:02:20.530118 IP 10.48.2.14.49662 > 10.51.240.10.53: 62251+ AAAA? metadata.google.internal.default.svc.cluster.local. (68)
22:02:20.530633 IP 10.48.2.14.52955 > 10.51.240.10.53: 49598+ A? metadata.google.internal.svc.cluster.local. (60)
22:02:20.530657 IP 10.48.2.14.52955 > 10.51.240.10.53: 49950+ AAAA? metadata.google.internal.svc.cluster.local. (60)
22:02:20.530957 IP 10.48.2.14.40507 > 10.51.240.10.53: 45306+ A? metadata.google.internal.cluster.local. (56)
22:02:20.530984 IP 10.48.2.14.40507 > 10.51.240.10.53: 45635+ AAAA? metadata.google.internal.cluster.local. (56)
22:02:20.531301 IP 10.48.2.14.50796 > 10.51.240.10.53: 62724+ A? metadata.google.internal.c.project-name.internal. (67)
22:02:20.531330 IP 10.48.2.14.50796 > 10.51.240.10.53: 63148+ AAAA? metadata.google.internal.c.project-name.internal. (67)
22:02:20.535246 IP 10.48.2.14.53737 > 10.51.240.10.53: 1695+ A? metadata.google.internal.google.internal. (58)
22:02:20.535284 IP 10.48.2.14.53737 > 10.51.240.10.53: 2075+ AAAA? metadata.google.internal.google.internal. (58)

The quad-A lookups also multiply the # of total lookups. Ideally, the result could be read once and then cached in-memory, but this should be a big improvement.

In normal cases, this might not be so bad, but lots of apps on lots of pods running on GKE with lots of domains in the (default, not trivially changeable) search path results in a pretty heavy load on kube-dns / dnsmasq, which as of now, has no caching layer.

JustinBeckwith · 2018-11-13T16:35:19Z

Thanks for the patience folks. Full disclosure - we are waiting on a dns lookup fix to land in Google Cloud Functions before we can merge this in.

JustinBeckwith

Need to wait for GCF fix to land

JustinBeckwith · 2018-11-19T19:02:51Z

@crwilcox @ofrobots the system test did exactly what we wanted it to! 💃 I've never been so excited to see a failing test.

stephenplusplus · 2018-11-19T19:47:32Z

We did this elsewhere: googleapis/google-cloud-node#2214, and we had to revert it because of this: googleapis/google-cloud-node#2249.

ofrobots · 2018-11-20T15:32:29Z

After reviewing the above, I am forced to conclude that software is terrible. 😭

Let's just use an ip address and avoid DNS resolution altogether. I believe the python client libs already use ips.

JustinBeckwith

Handing this over to someone else to approve :)

cause

ofrobots

This is going to be a bit more complicated, unfortunately. I should have thought of this earlier, but given the test:

const request = require('request');

const IP = '169.254.169.254';

const host = process.argv[2] || IP;
const url = `http://${host}/computeMetadata/v1/instance/id`;
console.log('requesting', url);
request(url, (err, res) => {
  console.dir(err, res);
});

Consider the output:

❯ time node ip.js metadata.google.internal
requesting http://metadata.google.internal/computeMetadata/v1/instance/id
{ Error: getaddrinfo ENOTFOUND metadata.google.internal metadata.google.internal:80
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:58:26)
  errno: 'ENOTFOUND',
  code: 'ENOTFOUND',
  syscall: 'getaddrinfo',
  hostname: 'metadata.google.internal',
  host: 'metadata.google.internal',
  port: 80 }
node ip.js metadata.google.internal  0.22s user 0.05s system 97% cpu 0.272 total

~/tmp/foo
❯ time node ip.js
requesting http://169.254.169.254/computeMetadata/v1/instance/id
{ Error: connect ETIMEDOUT 169.254.169.254:80
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1117:14)
  errno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect',
  address: '169.254.169.254',
  port: 80 }
node ip.js  0.20s user 0.05s system 0% cpu 1:15.74 total

The good thing about the DNS lookup was that it failed before being subject to the OS controlled TCP timeouts. Not to mention retries.

ofrobots · 2018-11-21T07:26:04Z

We need another fast way to detect if an environment is GCP before attempting the TCP connection.

ofrobots · 2018-11-21T18:07:28Z

The solution will need to achieve these attributes:

We cannot use a trailing dot because software is terrible (we get Java IllegalArgumentExceptions on Android as per @google-cloud/storage file getSignedUrl returns trailing dot in URL google-cloud-node#2249.)
Regardless of if we are running on GCP or not, we should be able to discover the presence (or otherwise) of the metadata service fast (~seconds, ideally milliseconds).
Should not make repeated requests to resolve addresses. (i.e. use DNS caching).

Concurrently:

Try to resolve metadata.google.internal (no trailing dot).
Open a TCP socket to 169.254.169.254. If the socket hasn't connected in N seconds, treat this as a failure to connect. N is a small number, say 5. (Is there an SLA for the metadata service response time?)

Whichever operation finishes first, cache the result as the hostname of the metadata service and use that going forward.

When running on GCP, step 2 will finish in milliseconds. When running off GCP, step 2 will fail, so the operation latency will be dependent on the first check passing. This adds no additional latency to our current behavior in the off-GCP environment.

WDYT?

theacodes · 2018-11-21T18:14:35Z

s there an SLA for the metadata service response time?

The metadata service has absolutely no SLA.

wyardley · 2018-11-21T18:28:54Z

Try to resolve metadata.google.internal (no trailing dot).

Resolving it without the trailing dot will result in Nx # of DNS requests for each entry in the OS’s resolver library search path before trying it as a FQDN. My suspicion is that resolving it should work on all platforms; it’s actually using it in the Host: header that’s probably causing a problem.

ofrobots · 2018-11-21T19:58:49Z

Actually as worded, my strategy doesn't depend upon the SLA for metadata service at all (any more than today). The concurrent direct connection to the IP is merely a fast path.

@wyardley

Resolving it without the trailing dot will result in Nx # of DNS requests for each entry in the OS’s resolver library search path before trying it as a FQDN

In GCP environments the IP connect will finish first which means that these M DNS queries will be initiated but the result will be inconsequential.

wyardley · 2018-11-21T20:10:23Z

@ofrobots in GKE (where there are ~ 6 domains in the search path and the resolver makes both A and quad-A lookups), this will hit kube-dns heavily which is the whole thing this is trying to fix, so I wouldn’t say it’s inconsequential. Also, wouldn’t deciding too quickly that we’re not in GCP potentially could cause auth failures, no?

If it’s only for isGCP check, the number of lookups shouldn’t matter as much but if it’s just a DNS lookup, vs an http request, the trailing dot shouldn’t cause problems.

ofrobots · 2018-11-21T20:36:13Z

Also, wouldn’t deciding too quickly that we’re not in GCP potentially could cause auth failures, no?

We don't decide that we are not GCP until both IP-connect and DNS-lookups fail, so no, from auth point of view my proposal is same as current behavior.

in GKE (where there are ~ 6 domains in the search path and the resolver makes both A and quad-A lookups), this will hit kube-dns heavily which is the whole thing this is trying to fix, so I wouldn’t say it’s inconsequential.

To make sure I understand, is the 6 lookups the problem or the fact that there was no cache of the result?

My suspicion is that resolving it should work on all platforms; it’s actually using it in the Host: header that’s probably causing a problem.

I think we need more definitive information here. Any one with Java chops willing to try OkHttp (possibly on Android)?

wyardley · 2018-11-21T20:57:28Z

I think I t’ll be 12-14 lookups for each request or so, because there’s both the A and quad-A lookups. @crwilcox has a POC I threw together that generates about 80 or so DNS lookups by connecting to a handful of Google services in GKE using the official libs.

Yes, once GKE rolls out better caching in front of dnsmasq / kube-dns, this should be less of a problem. I don’t think that will be rolled out til next year sometime?

ofrobots · 2018-11-21T21:00:45Z

I'm also proposing we cache the resolution result locally in this module – that is this library will try to resolve metadata.google.internal only once (which will generate M DNS requests based on the DNS search path). However, once resolved we would not see any more lookups regarless of whether kube-dns has caching. Would that address your concerns?

ofrobots · 2018-11-21T21:01:32Z

As a further optimization perhaps we can limit to A lookups only.

wyardley · 2018-11-21T21:06:54Z

Yes, resolving it once should definitely be a massive improvement, and would also be cleaner than hard-coding the IP.

I’m pretty confident, though, that if you’re using that approach, it should be safe (and most correct) to use the trailing dot as well.

ofrobots · 2018-11-21T21:19:43Z

My suspicion is that resolving it should work on all platforms; it’s actually using it in the Host: header that’s probably causing a problem.

Staring at the Java exception a bit more, I think you're right. The error is a SSL handshake error which implies that the DNS lookup was already done. It would be good to get something more definitive though.

wyardley · 2018-11-30T05:34:03Z

Some really interesting points made in this link pointed to me by Google support:

nodejs/node#15780 (comment)
which then links to:
https://github.com/gliderlabs/docker-alpine/blob/master/docs/caveats.md
https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html

These discuss some of the finer points of why this can cause such headaches with GKE, especially with Alpine base images. I'm not sure if there's been any discussion about changing options ndots:5 to 1 or 2 in the GKE nodes' /etc/resolv.conf, but my quick read is that this would tend to mitigate some of these problems.

For me, adding this to the Kube deployment spec "fixes" the problem.

      dnsConfig:
        options:
          - name: ndots
            value: "2"

JustinBeckwith · 2018-12-11T17:50:45Z

It looks like this actually got fixed in GCF! @ofrobots mind taking another look?

ofrobots

This change LGTM as it will not cause googleapis/google-cloud-node#2249 to come back as it is related to service endpoints hostnames being used with the trailing dot in the headers.

In other libraries, I expect we will want to add a trailing dot too; but that needs to be done with more care.

kalyanac · 2019-06-06T23:03:44Z

We need another fast way to detect if an environment is GCP before attempting the TCP connection.

FWIW, Here is how go handles it:
https://github.com/googleapis/google-cloud-go/blob/master/compute/metadata/metadata.go#L119

Use a FQDN for metadata.google.internal

6d85834

Minimize lookups by not expanding search path when resolving metadata.google.internal

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Oct 31, 2018

crwilcox requested a review from JustinBeckwith October 31, 2018 21:35

JustinBeckwith added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Oct 31, 2018

Merge branch 'master' into use-fqdn-for-metadata-host

e5f3a78

JustinBeckwith previously requested changes Nov 13, 2018

View reviewed changes

Merge branch 'master' into use-fqdn-for-metadata-host

179b8b3

JustinBeckwith added kokoro:force-run Add this label to force Kokoro to re-run the tests. and removed do not merge Indicates a pull request not ready for merge, due to either quality or timing. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Nov 19, 2018

kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 19, 2018

JustinBeckwith added 3 commits November 20, 2018 20:19

Update index.ts

4bdcb5f

Merge branch 'master' into use-fqdn-for-metadata-host

18a0b4b

Merge branch 'master' into use-fqdn-for-metadata-host

e2cb480

JustinBeckwith changed the title ~~Use a FQDN for metadata.google.internal~~ fix: Use IP address instead of metadata.google.internal Nov 21, 2018

JustinBeckwith reviewed Nov 21, 2018

View reviewed changes

JustinBeckwith requested review from theacodes and ofrobots November 21, 2018 05:18

ofrobots suggested changes Nov 21, 2018

View reviewed changes

Update index.ts

6f08407

JustinBeckwith changed the title ~~fix: Use IP address instead of metadata.google.internal~~ fix: Use a FQDN for metadata.google.internal Nov 26, 2018

JustinBeckwith added 3 commits December 7, 2018 08:35

Merge branch 'master' into use-fqdn-for-metadata-host

2189b58

Merge branch 'master' into use-fqdn-for-metadata-host

291bd1d

Merge branch 'master' into use-fqdn-for-metadata-host

b8dce30

JustinBeckwith approved these changes Dec 11, 2018

View reviewed changes

ofrobots approved these changes Dec 11, 2018

View reviewed changes

JustinBeckwith merged commit ffccfc8 into master Dec 11, 2018

crwilcox mentioned this pull request Dec 27, 2018

Add period at end of metadata.google.internal to prevent unnecessary DNS lookups grpc/grpc#17598

Merged

kjin mentioned this pull request Feb 25, 2019

fix: remove dot from host address #171

Closed

mayurkale22 mentioned this pull request Mar 15, 2019

Update dependency gcp-metadata to v1 census-instrumentation/opencensus-node#346

Merged

1 task

stephenplusplus deleted the use-fqdn-for-metadata-host branch March 18, 2019 15:32

theacodes mentioned this pull request Jun 6, 2019

Drop trailing dot from HOST_ADDRESS for Istio compatibility #210

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Use a FQDN for metadata.google.internal #117

fix: Use a FQDN for metadata.google.internal #117

crwilcox commented Oct 31, 2018 •

edited by JustinBeckwith

Loading

wyardley commented Oct 31, 2018 •

edited

Loading

JustinBeckwith commented Nov 13, 2018

JustinBeckwith left a comment

JustinBeckwith commented Nov 19, 2018

stephenplusplus commented Nov 19, 2018

ofrobots commented Nov 20, 2018

JustinBeckwith left a comment

ofrobots left a comment •

edited

Loading

ofrobots commented Nov 21, 2018

ofrobots commented Nov 21, 2018 •

edited

Loading

theacodes commented Nov 21, 2018

wyardley commented Nov 21, 2018

ofrobots commented Nov 21, 2018

wyardley commented Nov 21, 2018 •

edited

Loading

ofrobots commented Nov 21, 2018

wyardley commented Nov 21, 2018

ofrobots commented Nov 21, 2018

ofrobots commented Nov 21, 2018

wyardley commented Nov 21, 2018

ofrobots commented Nov 21, 2018

wyardley commented Nov 30, 2018 •

edited

Loading

JustinBeckwith commented Dec 11, 2018

ofrobots left a comment

kalyanac commented Jun 6, 2019

fix: Use a FQDN for metadata.google.internal #117

fix: Use a FQDN for metadata.google.internal #117

Conversation

crwilcox commented Oct 31, 2018 • edited by JustinBeckwith Loading

wyardley commented Oct 31, 2018 • edited Loading

JustinBeckwith commented Nov 13, 2018

JustinBeckwith left a comment

Choose a reason for hiding this comment

JustinBeckwith commented Nov 19, 2018

stephenplusplus commented Nov 19, 2018

ofrobots commented Nov 20, 2018

JustinBeckwith left a comment

Choose a reason for hiding this comment

ofrobots left a comment • edited Loading

Choose a reason for hiding this comment

ofrobots commented Nov 21, 2018

ofrobots commented Nov 21, 2018 • edited Loading

theacodes commented Nov 21, 2018

wyardley commented Nov 21, 2018

ofrobots commented Nov 21, 2018

wyardley commented Nov 21, 2018 • edited Loading

ofrobots commented Nov 21, 2018

wyardley commented Nov 21, 2018

ofrobots commented Nov 21, 2018

ofrobots commented Nov 21, 2018

wyardley commented Nov 21, 2018

ofrobots commented Nov 21, 2018

wyardley commented Nov 30, 2018 • edited Loading

JustinBeckwith commented Dec 11, 2018

ofrobots left a comment

Choose a reason for hiding this comment

kalyanac commented Jun 6, 2019

crwilcox commented Oct 31, 2018 •

edited by JustinBeckwith

Loading

wyardley commented Oct 31, 2018 •

edited

Loading

ofrobots left a comment •

edited

Loading

ofrobots commented Nov 21, 2018 •

edited

Loading

wyardley commented Nov 21, 2018 •

edited

Loading

wyardley commented Nov 30, 2018 •

edited

Loading