Cluster not entirely spun up (API Record Not Created) #859

Closed
petergarbers opened this Issue Nov 9, 2016 · 42 comments

Projects

Backlog in 1.4.4 Release

@petergarbers

I’m trying to setup a cluster in a new aws account using kops following this guide

I have noticed that 2 records aren’t being created. api.clustername.domain.com and api.internal.clustername.domain.com. I only have the domain records for the etcd service.

As a result I am unable to connect to my cluster using kubectl.
From what I can tell the master and the other nodes are running, however, manually creating these domain records has been unfruitful. So I suspect there may be other issues.

@yissachar
Contributor

@justinsb @chrislovecnm I'm seeing a ton of reports with this issue in the past week on Slack. Anything changed recently that would cause this?

@petergarbers

I feel like I should mention that mine resolved itself after ~20 minutes but I'm leaving this open as other people are still seeing the issue

@shrabok-surge

I experienced this when I didn't have my NS records publishing the subdomain I was using. When the NS records were added the api records were created.

@tomdavidson

@shrabok-surge can you be more specific? Maybe even use the example from http://kubernetes.io/docs/getting-started-guides/kops/

When you ref subdomain are you saying useast1 or dev in useast1.dev.example.com ?

@tomdavidson

checkout out the terraform output there are no route 53 resources. how is the cluster subdomains suppose to be created?

@juliendf

@petergarbers From a machine within the same VPC, are you able to resolve your domain : dig ns clustername.domain.com ?

@cyberroadie

@CliMz I have the same problem (no api domain names, and do have etcd names). I successfully resolved dig ns clustername.domain.com from a machine within the same VPC

@cyberroadie

I waited over an our but still no api domain :-/

I ssh'd into the admin node and in the /var/log/kube-apiserver.log i see:
controller.go:88] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured

Is this relevant?

@chrislovecnm
Member

@cyberroadie please add you install command, kops version, aws region.

@chrislovecnm chrislovecnm self-assigned this Nov 10, 2016
@cyberroadie

Solved: Found out what the problem was: misconfiguration of the DNS subdomain. Logging in into the master node and looking in the /var/log/etcd.log file I could see the region.dev.xx.xx domain didn't get resolved. This prevented the etcd server from starting and subsequently prevented the api server from starting because it couldn't connect to the etcd cluster.

@tomdavidson

@cyberroadie is this a bug or is this a subdomain you configured for for the cluster? For example, I have done exactly has described in http://kubernetes.io/docs/getting-started-guides/kops/ but have the same symptoms you have had.

@cyberroadie

@tomdavidson It's not a bug. I made a mistake setting up the subdomain in route53

@tomdavidson

@cyberroadie We are not the only ones so maybe we are all making the same mistake. Is there an issue with the steps in http://kubernetes.io/docs/getting-started-guides/kops/ ?

@cyberroadie

I can describe to steps I took: I was creating a test run for setting up a kubernetes cluster. As described, if your not in control of the main domain e.g. testing.net you can create a hosted zone for the subdomain e.g. dev.testing.net. This will be the case in a future project. But now as a test I added two hosted zones in route53 with a domain I control myself. One was for testing.net and another for dev.testing.net. This didn't work. resolving it with dig ns dev.testing.net returned the dns server of testing.net and couldn't find dev.testing.net. So for the test I dropped the dev.testing.net hosted zone and let everything add to the 'testing.net' hosted zone. I gave priority to testing the cluster first; I still have to figure out how to do the sub domain hosted zone. Re-reading the documentation now, have to say I'm slightly puzzled how to setup the NS records correctly in this scenario.

@cyberroadie

PS the etcd domain names where added to the dev.testing.net hosted zone

@tomdavidson

Yes, Im conused about the NS too. In my case I deletated a zone to r53 - c.b.a.edu. Then with kops create I used a name such as tom.c.b.a.edu. etcd records were created for tom.c.b.a.edu but nothing else.

@chrislovecnm
Member

Can I get a status on where this issue is at? Not following the comments ;)

@tomdavidson

@chrislovecnm
I have not been able to confirm the NS is configured as needed by kops. I have done excatly as discribed in http://kubernetes.io/docs/getting-started-guides/kops/ but the direction is not the clear.

This is potentially all user error / unclear docs but until we can clarify the needed config we can not verify it is soley user error.

@chrislovecnm
Member

Take a look at this http://blog.couchbase.com/2016/november/multimaster-kubernetes-cluster-amazon-kops

We have an issue to drop something like this into our docs

@cyberroadie

I'm going to have a play around this weekend to see if I can create more clarity. For the project I'm currently working on it would be ideal if every developer team has control over it's own subdomain and the main domain is controlled separately. (Both in route53)

Test scenario:

  • parent domain in a hosted zone in route53
  • sub domains in separate hosted zones, also in route53

So far we know:

  • with a kops install the etcd domains get added to the subdomain hosted zone
  • when a node tries to resolve a etcd domain it goes to the parent domain hosted zone (ignoring the subdomain hosted zone)

Acceptance criteria

  • VM within the VPC setup by kops resolves the etcd domain name

Outcome:

  • Step by step documentation on how to do this

PS You can find me on slack: #kubernetes-users

@tomdavidson

Deleted my Route53 zones and created new ones. This time the api record was created. FYI the default limit on a new aws account kept my autoscale group from populating - encase there is a common problems section in the new docs.

@chrislovecnm
Member

@tomdavidson there is a troubleshooting guide, feel free to update

@cyberroadie
cyberroadie commented Nov 19, 2016 edited

Update:
So I did my test scenario and here are the results:
I will use example.com as an example domain:

  1. Created two separate hosted zones in AWS Route53:
    One for example.com and one for dev.example.com

  2. Setting up 'route delegation': (this is the proper name for it)
    Copy the 4 nameservers from the NS record of dev.example.com and create a new NS record in example.com and add these subdomain nameservers to it. Give dev.example.com as the name of the new NS record. After this is done the parent domain (example.com) will delegate all request for *.dev.example.com to the correct hosted zone in route53. Also if you create a new domain (e.g. test.dev.example.com) via AWS route53 command line tool., it will be added to the subdomain hosted zone.

  3. After this you can setup Kubernetes (with Kops) and the new domain names (for etcd, etcd-event, etc, ...) will be added to the dev.example.com hosted zone.

The advantage of this is that you can hand over the control of a subdomain to another team without losing control over your parent domain.

Regarding the documentation I think it would be good to add 'delegating DNS request to a subdomain' and that in order to do that you have to create a seperate NS record in the parent domain hosted zone with the nameservers of the hosted zone of the subdomain.

One observation: with a setup like this all new subdomains where added almost instantly, I never had to wait more that a minute to see them appear in the subdomain hosted zone.

@MichaelJCole MichaelJCole added a commit to MichaelJCole/kops that referenced this issue Nov 19, 2016
@MichaelJCole MichaelJCole De-duplicate documentation
There are two versions of this documentation.  Neither describe how to make a separate zone correctly on AWS, and this causes issues like: kubernetes#859
d73d4cf
@MichaelJCole
MichaelJCole commented Nov 19, 2016 edited

Set out to make a PR to fix this. 27 bot-mails, 20 mins of signups, email confirmations, linking accounts, and a contract that needed my address. I've declined to work that hard to give you free work as a PR.

You may find this interesting on configuring Route53 subdomains

So, I had this problem, and I can verify my root cause and the fix:

What happened: I followed this and everything worked up until:

$ kubectl get nodes
Unable to connect to the server: dial tcp: lookup api.cluster.stage.example.io on 127.0.1.1:53: no such host

Why? The api record wasn't configuring as described above.

Root cause: My DNS wasn't configured correctly. I had a parent domain example.io and a subdomain stage.example.io

Fix: Add a NS record to the parent domain for the subdomain, with the subdomains NS servers as described in the article above.

Thanks for the awesome tool :-)

@chrislovecnm chrislovecnm added this to the 1.4.6 milestone Nov 23, 2016
@chrislovecnm chrislovecnm modified the milestone: 1.4.6, 1.4.2 Nov 28, 2016
@kris-nova kris-nova assigned justinsb and unassigned chrislovecnm and zmerlynn Nov 28, 2016
@ndtreviv

For people coming back to this issue: We did everything that @MichaelJCole did in advance of creating our clusters (ie: created NS records with the sub-domain NSs in the root domain hosted zone), and it still took about 20 mins for everything to come up.

It took a good while for the api* routes to be created, and even then took a while for the DNS records to propagate. kubectl get nodes was returning no such host all the time, then was successful 1 in every 4 times (note: there were 4 name servers), then worked more regularly, then eventually worked every time.

So, be aware:

  1. When you bring your cluster up, it takes a while for the api* A records to be created
  2. When the api* A records are created, it doesn't mean kubectl will work instantly
  3. When kubectl does start working, it may seemingly be intermittent whilst the DNS propagates
  4. Eventually, everything will be fine. Kick back, cool off, (open|pour) a (cold|hot) one
  5. This happens even if you're bringing up a second cluster on the same hosted zone (eg: hosted zone = k8s.mydomain.com; one cluster already exists at: useast1.user1.dev.k8s.mydomain.com; second cluster created at: useast.user2.dev.k8s.mydomain.com)
@kris-nova kris-nova changed the title from Cluster not entirely spun up to Cluster not entirely spun up (API Record Not Created) Dec 11, 2016
@deitch
deitch commented Dec 19, 2016

Glad to see I am not the only one with this issue. It did take a while.

How does it create the route53 record? I ran kops --target terraform followed by terraform apply, so everything was created via terraform. Yet there is no route53 resource anywhere in kubernetes.tf or the data/ dir.

@jaygorrell

Masters manage the DNS - TF wouldn't know how to manage IPs of instances that may get replaced.

@deitch
deitch commented Dec 19, 2016

@jaygorrell the k8s master creates the route53 entry for api.subdom.mydomain.com? I thought that is set up at creation time by terraform? Is that why it sometimes takes a while, as opposed to being immediate (which would happen if terraform had done it)?

I begin to understand. :-)

2 questions:

  1. Are there k8s docs anywhere about creating the route53 entry?
  2. Since it knows how to integrate with route53, is there any reason it cannot be configured to create a CNAME entry for an ELB when it creates a Service with type=LoadBalancer?
@jaygorrell

Terraform wouldn't know the IP before it's assigned to those instances and it api is a RR list of each IP - not a CNAME to an ELB or anything.

  1. There's a little bit on that here, that may give you terminology to dig deeper if you'd like: https://github.com/kubernetes/kops/blob/master/docs/boot-sequence.md#api-server-bringup
  2. That's exactly what https://github.com/Vungle/kube-route53 does -- I'm not sure if anything could be added to kops directly to support that or not, though.
@deitch
deitch commented Dec 19, 2016

Terraform wouldn't know the IP before it's assigned to those instances

Got that. It was the api. address that I was confused about.

api is a RR list of each IP

each IP? Isn't it a single one? Oh, you mean multiple masters? OK, got that. Makes sense.

There's a little bit on that here
Thanks, that does help.

That's exactly what https://github.com/Vungle/kube-route53 does
Does it? I am looking specifically for the ELB when service is type=NodeBalancer. It is that service that is most likely to be exposed to the outside world.

@jaygorrell

Sorry, I failed at Google. Meant to link this one:
https://github.com/wearemolecule/route53-kubernetes

@deitch
deitch commented Dec 19, 2016

Oooh, now that is interesting. Thanks @jaygorrell!

@chrislovecnm
Member

@jaygorrell that is also a lot of what dnscontroller does :) You already have that installed on a kops cluster.

@jaygorrell

Ah yes - didn't realize there was a kops release a few days ago... been waiting on that one!

So this should work now, yes?
https://github.com/kubernetes/kops/tree/master/dns-controller

@deitch
deitch commented Dec 21, 2016

Damn! I accidentally clicked close on this tab and it lost my comment. Don't know what GitHub does to prevent the browser from recognizing that there is entered text, but it is not wise!

OK, recreating:

As far as I can tell, it looks like:

  • route53-kubernetes creates a CNAME for Service of type=LoadBalancer to point to the auto-generated hostname of the ELB.
  • dns-controller creates multiple A records, one for each node on which the given Service has a Pod running, which is useful for a Service of type=NodePort.

Is that right? If so, I would love to try it.

@justinsb justinsb removed the area/DNS label Dec 21, 2016
@chrislovecnm
Member

@deitch looking to see if there is an issue open for better documentation.

@deitch
deitch commented Dec 21, 2016

Thanks @chrislovecnm

@chrislovecnm
Member

#1230 <- lets talk there

@justinsb justinsb modified the milestone: 1.4.4, 1.5.0 Dec 28, 2016
@pl1ght
pl1ght commented Jan 2, 2017

Initially same issue here. What it boils down to is something I'm betting a LOT of us overlooked. When you create your initial Hosted Zone for your sub-domain, you get a DIFFERENT SET of NS records from AWS than what your Parent domain has. I initially copy and pasted in the same NS records from my Parent domain into my subdomain's NS record in the Parent hosted zone. Then I deleted, and copy and pasted the DIFFERENT NS records from my sub-domain hosted zone into the parent Hosted Zone for my sub-domain name NS record. Fixed the missing records instantly. Just re-ran the update cluster --yes and voila!

@chrislovecnm
Member

So can we close this?

@petergarbers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment