Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use terraform and gossip-based cluster at the same time #2990

Open
simnalamburt opened this issue Jul 18, 2017 · 36 comments

Comments

@simnalamburt
Copy link

commented Jul 18, 2017

If you create a cluster with both terraform and gossip options enabled, all kubectl commands shall fail.


How to reproduce the error

My environment

$ uname -a
Darwin *****.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64

$ kops version
Version 1.6.2

$ terraform version
Terraform v0.9.11

$ aws --version
aws-cli/1.11.117 Python/2.7.10 Darwin/16.6.0 botocore/1.5.80

Setting up the cluster

# Create RSA key
ssh-keygen -f shared_rsa -N ""

# Create S3 bucket
aws s3api create-bucket \
  --bucket=kops-temp \
  --region=ap-northeast-1 \
  --create-bucket-configuration LocationConstraint=ap-northeast-1

# Create terraform codes and some resources
# including *certificates* will be stored to S3
kops create cluster \
  --name=kops-temp.k8s.local \
  --state=s3://kops-temp \
  --zones=ap-northeast-1a,ap-northeast-1c \
  --ssh-public-key=./shared_rsa.pub \
  --out=. \
  --target=terraform

# Create cluster
terraform init
terraform plan -out ./create-cluster.plan
terraform show ./create-cluster.plan | less -R # final review
terraform apply ./create-cluster.plan # fire

# Done

Spoiler Alert: Creating the self-signed certificate before creating actual Kubernetes cluster is the root cause of this issue. Please continue to see why.

Scenario 1. Looking up non-existent domain

$ kubectl get nodes
Unable to connect to the server: dial tcp: lookup api.kops-temp.k8s.local on 8.8.8.8:53: no such host

This is basically because of erroneous ~/.kube/config file. If you run the kops create cluster with both terraform and gossip options enabled, you'll get wrong ~/.kube/config file.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api.kops-temp.k8s.local
            # !!!! There's no such domain named "api.kops-temp.k8s.local"
  name: kops-temp.k8s.local
# ...

Let's manually correct that file. Or, you'll get good config file if you explicitly export the configuration once again.

kops export kubecfg kops-temp.k8s.local --state s3://kops-temp

Then the non-existent domain will be replaced with the ELB of master nodes' DNS name.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com
  name: kops-temp.k8s.local
# ...

And you'll be ended up to the scenario 2 when you retry.

Scenario 2. Invalid certificate

$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for api.internal.kops-temp.k8s.local, api.kops-temp.k8s.local, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, not api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com

This is simply because the DNS name of ELB is not included in the certificate. This scenario occurs only when you create the cluster with terraform option being enabled. If you try to create the cluster only with gossip option not using the terraform target, the self-signed certificate will properly contain the DNS name of ELB.

2017-07-19 12 01 32

(Sorry for the Korean, this is the list of DNS alternative names of certificate)

The only way to workaround this problem is forcing "kops-temp.k8s.local" to point proper IP address via manually editing /etc/hosts, which is undesired for many people.

# Recover ~/.kube/config
perl -i -pe \
    's|api-kops-temp-k8s-local-rnvnqsr-666666\.ap-northeast-1\.elb\.amazonaws\.com|api.kops-temp.k8s.local|g' \
    ~/.kube/config

# Hack /etc/hosts
host api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com |
    perl -pe 's|^.* address (.*)$|\1\tapi.kops-temp.k8s.local|g' |
    sudo tee -a /etc/hosts

# This will succeed
kubectl get nodes

I'm not very familiar with Kops internal, but I expect a huge change to properly fix this issue. Maybe using AWS Certificate Manager can be a solution. (#834) Any ideas?

@simnalamburt simnalamburt changed the title Cannot use terraform and gossip-based cluster in the same time Cannot use terraform and gossip-based cluster at the same time Jul 18, 2017

@gregd72002

This comment has been minimized.

Copy link

commented Oct 4, 2017

I can reproduce the problem using kops 1.7.5

@pastjean

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2017

If you run kops update cluster $NAME --target=terraform after the terraform apply it, it's actuallyt gonna generate a new certificate kops export kubecfg $NAME after that and you got a working thing. Although i know, its not pretty straightforward

@thedonvaughn

This comment has been minimized.

Copy link

commented Oct 21, 2017

I also had the same reported issue. I took @pastjean advice and re-run kops update cluster $NAME --target=terraform and then kops export kubecfg $NAME. While this updated my kube config with the proper DNS name to the api ELB, I still have an invalid cert error.

@thedonvaughn

This comment has been minimized.

Copy link

commented Oct 21, 2017

Nevermind. I have to create the cluster with --target=terraform first. After running terraform apply and then updating I get a new master cert. I was creating the cluster, then I updated with --target=terraform, then apply, then I re-ran the update. This didn't generate a new cert. So my bad on the order. Issue is resolved. Thanks.

@chrislovecnm

This comment has been minimized.

Copy link
Member

commented Oct 22, 2017

Closing!

@sybeck2k

This comment has been minimized.

Copy link

commented Nov 6, 2017

Bug is still valid for me - and @pastjean solution is not working for me. I'm using an s3 remote store, here are my versions:

$ uname -a
Darwin xxxxx 17.0.0 Darwin Kernel Version 17.0.0: Thu Aug 24 21:48:19 PDT 2017; root:xnu-4570.1.46~2/RELEASE_X86_64 x86_64
$ kops version
Version 1.7.1
$ terraform version
Terraform v0.10.8
$ aws --version
aws-cli/1.11.137 Python/2.7.10 Darwin/17.0.0 botocore/1.6.4

to reproduce, I do the same steps as @simnalamburt reported. I then run kops update cluster $NAME --target=terraform --out=. and terraform apply, but I still have an invalid
certificate (it does not get the alias of the AWS LB).

Checking the s3 store, in the folder <cluster-name>/pki/issued/master, I can see that a first certificate is created when creating the cluster with kops - and a 2nd is added after the kops update request. The 2nd certificate does include the LB DNS name - but it is not deployed into the master(s) nodes.

Here is the update command output:

kops update cluster $NAME --target=terraform --out=.
I1106 18:10:54.285184    8239 apply_cluster.go:420] Gossip DNS: skipping DNS validation
I1106 18:10:55.044907    8239 executor.go:91] Tasks: 0 done / 83 total; 38 can run
I1106 18:10:55.467860    8239 executor.go:91] Tasks: 38 done / 83 total; 15 can run
I1106 18:10:55.469345    8239 executor.go:91] Tasks: 53 done / 83 total; 22 can run
I1106 18:10:56.032321    8239 executor.go:91] Tasks: 75 done / 83 total; 5 can run
I1106 18:10:56.691785    8239 vfs_castore.go:422] Issuing new certificate: "master"
I1106 18:10:57.160535    8239 executor.go:91] Tasks: 80 done / 83 total; 3 can run
I1106 18:10:57.160867    8239 executor.go:91] Tasks: 83 done / 83 total; 0 can run
I1106 18:10:57.261829    8239 target.go:269] Terraform output is in .
I1106 18:10:57.529372    8239 update_cluster.go:247] Exporting kubecfg for cluster
Kops has set your kubectl context to ci5-test.k8s.local

Terraform output has been placed into .

Changes may require instances to restart: kops rolling-update cluster

As you can see, the log reports that the certificate is generated. I've tried doing a kops rolling-update cluster --cloudonly as recommended, but the output is No rolling-update required.

@jlaswell

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2017

@sybeck2k, we have also experienced this issue as of a few hours ago.

You will need to run kops rolling-update cluster --cloudonly --force --yes to force an update. This can take awhile depending on the size of the cluster, but we have found that trying to manually set the --master-interval or --node-interval can prevent nodes from reaching a Ready state. I suggest just grabbing some ☕️ and let the default interval do it's thing.

It is still a workaround solution atm, but we have found it to be repeatably successful.

@chrislovecnm

This comment has been minimized.

Copy link
Member

commented Nov 6, 2017

This should be fixed in master. If someone wants to test master or wait for the 1.8 beta release

@sybeck2k

This comment has been minimized.

Copy link

commented Nov 7, 2017

@jlaswell thanks a lot! I can confirm your workaround works for kops 1.7.1 .
Could anyone point me to the details of what is exactly pulled from the state store, and when? In the doc I found this information

@jlaswell

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2017

Not sure about what is used when. I would bet that looking through some of the source code best for that but, I do know that you can look in the s3 bucket used for the state store if you are using AWS. We've perused that a few times to get an understanding.

@shashanktomar

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2017

@chrislovecnm I can still reproduce this in 1.8.0-beta.1. Both the steps are still required:

  • kops update cluster $NAME --target=terraform --out=.
  • kops rolling-update cluster --cloudonly --force --yes

@chrislovecnm chrislovecnm reopened this Nov 12, 2017

@chrislovecnm

This comment has been minimized.

Copy link
Member

commented Nov 12, 2017

@shashanktomar I would assume the work flow is

  1. kops update cluster --target=terraform
  2. terraform apply (not sure the syntax is correct)
  3. kops rolling-update cluster

What does rolling update show?

If would be a bug that the update does not create the same hash in the tf code that we are doing in the direct target code path.

@andresguisado

This comment has been minimized.

Copy link

commented Nov 15, 2017

@chrislovecnm I can reproduce this in 1.8.0-beta.1 as well. As @shashanktomar are still required:

  • kops update cluster $NAME --state s3://bucket --target=terraform --out=.
  • kops rolling-update cluster --cloudonly --force --yes

Here is the rolling update output:

Using cluster from kubectl context: dev.xxx.k8s.local

NAME			STATUS	NEEDUPDATE	READY	MIN	MAX
master-eu-west-2a	Ready	0		1	1	1
nodes			Ready	0		2	2	2
W1115 15:28:50.519884   16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:28:50.519898   16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "master-eu-west-2a.masters.dev.xxx.k8s.local".
 
 
W1115 15:33:50.723093   16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
W1115 15:33:50.723189   16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:33:50.723203   16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "nodes.dev.xxx.k8s.local".
W1115 15:35:50.930041   16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
W1115 15:35:50.930978   16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:35:50.931003   16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "nodes.dev.xxx.k8s.local".
W1115 15:37:51.117159   16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
I1115 15:37:51.117407   16811 rollingupdate.go:174] Rolling update completed!
@tspacek

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2017

I reproduced this in 1.8.0 after kops create cluster ... --target=terraform and terraform apply

I can confirm that running the following fixed it:
kops update cluster $NAME --target=terraform
kops rolling-update cluster $NAME --cloudonly --force --yes

@chrislovecnm

This comment has been minimized.

Copy link
Member

commented Dec 24, 2017

More detail please

@bashims

This comment has been minimized.

Copy link

commented Feb 5, 2018

I am having the same problem here with (see version info below), the work around does indeed work but it takes way too long to complete - it would be great if this could be resolved.

kops version

Version 1.8.0 (git-5099bc5)      
@mbolek

This comment has been minimized.

Copy link

commented Mar 6, 2018

As above, this is still broken in:
Version 1.8.1 (git-94ef202)

Generally, as I understand it, the workaround flow is:
kops create cluster $NAME --target=terraform -out=.
terraform apply
kops rolling-update cluster $NAME --cloudonly --force --yes (around 20 mins with 3masters and 3 nodes) and then it should work but I had to re-export kops config
kops export kubecfg $NAME
and now it works for both kops and kubectl.
Are there any ideas as to how resolve this? I was also wondering if, in general, gossip-based is inferior to DNS approach?

@Mosho1

This comment has been minimized.

Copy link

commented May 17, 2018

The fix using rolling-update did not work for me.

Version 1.9.0 (git-cccd71e67)

@mbolek

This comment has been minimized.

Copy link

commented May 17, 2018

@Mosho1 did you export the config?
Can you check if the server in the ~/.kube/config points to an external endpoint?

@Mosho1

This comment has been minimized.

Copy link

commented May 17, 2018

@mbolek yeah, it did, though I have already brought down that cluster and used kops directly instead.

@Hermain

This comment has been minimized.

Copy link

commented Jun 1, 2018

Fyi: Still broken in 1.9.0

@1ambda

This comment has been minimized.

Copy link

commented Jun 11, 2018

in 1.9.1 too. I am running a gossip-based cluster (.local)

and Was able to work around this issue by following comments above.

# assume that you already applied terraform once and ELB for kube api is generated on AWS

# make sure that export kubecfg before applying terraform so that LC is configured with exported cfg.
kops export kubecfg --name $NAME
kops update cluster $NAME --target=terraform --out=.
terraform plan
terraform apply 

kops rolling-update cluster $NAME --cloudonly --force --yes

In case of continuous failing you might add insecure-skip-tls-verify: true into the cluster entry in ~/.kube/config but usually its not recommended.

@gtmtech

This comment has been minimized.

Copy link

commented Jul 10, 2018

Who wants to do a rolling-update straight after provisioning a cluster? kops should provision the correct server entries in the kubectl config file in the first place - Given that kops creates a dns entry just fine with a sensible name e.g. api.cluster.mydomain.net (as an alias record to the elb/alb), why isnt kops export kubecfg using the alias record in the server and not the elb? This alias record is already in the certificate as OP says, and if kops generates a kubectl config entry using a server: https://[alias record], then it works just fine, and no rolling-updates or post-shenanigans are needed.

This should work out of the box

@mbolek

This comment has been minimized.

Copy link

commented Jul 25, 2018

#kops version
Version 1.9.2 (git-cb54c6a52

Ok... so I though I had something but it seems the issue persists. You need to export config to fix the API server endpoint and you need to roll master to fix the SSL cert

@drzero42

This comment has been minimized.

Copy link

commented Aug 30, 2018

Another workaround that does not require waiting to roll the master(s) is to create the ELB, then update the cluster and then do the rest of the terraform apply. Steps are:

  • Create cluster as usual
  • Create internet gateway, or ELB will fail to deploy: terraform apply -target aws_internet_gateway.CLUSTERNAME-k8s-local
  • Create ELB: terraform apply -target aws_elb.api-CLUSTERNAME-k8s-local
  • Update cluster (which will catch the DNS name for the ELB and issue a new master cert, as well as export a new kubecfg): kops update cluster --out=. --target=terraform
  • Create everything else: terraform apply
@mshivanna

This comment has been minimized.

Copy link

commented Sep 6, 2018

@mbolek the issue indeed persists
kops version
Version 1.10.0

@CosmoPennypacker

This comment has been minimized.

Copy link

commented Nov 7, 2018

@drzero42 - Thanks for the tip! it works, but you forgot to prefix -target on the 2nd apply step
i.e.:

Create ELB: terraform apply aws_elb.api-CLUSTERNAME-k8s-local

should be:
Create ELB: terraform apply -target aws_elb.api-CLUSTERNAME-k8s-local

@drzero42

This comment has been minimized.

Copy link

commented Nov 7, 2018

@CosmoPennypacker Absolutely right, good catch. I've updated my comment so people can more easily copy/paste from it ;)

@elliottgorrell

This comment has been minimized.

Copy link

commented Nov 8, 2018

For our creation script we have implemented this work around however as we don't care about downtime we do the following to speed up creation from ~20minutes to ~5 minutes

kops rolling-update cluster --cloudonly --force --master-interval=1s --node-interval=1s --yes

# Wait until all nodes come back online before marking complete
until kops validate cluster --name ${CLUSTER_NAME} > /dev/null
do
  echo "\033[1;93mWaiting until cluster comes back online\033[0m"
  sleep 5
done

echo "\033[1;92mCluster Creation Complete!\033[0m"
@teagy-cr

This comment has been minimized.

Copy link

commented Nov 11, 2018

Technically, if you template the steps that generate the API certificate, you could feed the ELB DNS name output in terraform to the script before it generates it initially and stores it in the state store.

@tkatrichenko

This comment has been minimized.

Copy link

commented Nov 19, 2018

@teagy-cr and do you know how to do that?

@jfreymann

This comment has been minimized.

Copy link

commented Jan 14, 2019

This is still a valid issue, using the workaround outlined above.

@MBalazs90

This comment has been minimized.

Copy link

commented Feb 24, 2019

This issue is still persist...

@pachecobruno

This comment has been minimized.

Copy link

commented Mar 14, 2019

This issue is still persist... Kops version = 1.11.1

kops validate cluster

Using cluster from kubectl context: milkyway.k8s.local
Validating cluster milkyway.k8s.local

unexpected error during validation: error listing nodes: Get https://api.milkyway.k8s.local/api/v1/nodes: dial tcp: lookup api.milkyway.k8s.local on 192.168.88.1:53: no such host

The configuration generated by kops and terraform continue to treat the API endpoint as a DNS .k8s.local and not with the ELB

lieut-data added a commit to mattermost/mattermost-cloud that referenced this issue Apr 18, 2019
Workaround kops+gossip+terraform issue.
Apply the fix from
kubernetes/kops#2990 (comment) to
ensure the gossip-based cluster is reachable when using terraform
output.
gabrieljackson added a commit to mattermost/mattermost-cloud that referenced this issue Apr 18, 2019
Workaround kops+gossip+terraform issue. (#7)
Apply the fix from
kubernetes/kops#2990 (comment) to
ensure the gossip-based cluster is reachable when using terraform
output.
@mbolek

This comment has been minimized.

Copy link

commented May 28, 2019

@lieut-data, @gabrieljackson

# kops validate cluster dev2.k8s.local
Validating cluster dev2.k8s.local

unexpected error during validation: error listing nodes: Get https://api.dev2.k8s.local/api/v1/nodes: dial tcp: lookup api.dev2.k8s.local on 192.168.1.1:53: no such host
# kops version
Version 1.12.1 (git-e1c317f9c)

:(
and then

kops update cluster dev2.k8s.local --target=terraform --out=.
I0528 09:50:11.688046    5454 apply_cluster.go:559] Gossip DNS: skipping DNS validation
I0528 09:50:13.463433    5454 executor.go:103] Tasks: 0 done / 95 total; 46 can run
I0528 09:50:14.477643    5454 executor.go:103] Tasks: 46 done / 95 total; 27 can run
I0528 09:50:15.245765    5454 executor.go:103] Tasks: 73 done / 95 total; 18 can run
I0528 09:50:16.011278    5454 executor.go:103] Tasks: 91 done / 95 total; 3 can run
I0528 09:50:17.764038    5454 vfs_castore.go:729] Issuing new certificate: "master"
I0528 09:50:19.326700    5454 executor.go:103] Tasks: 94 done / 95 total; 1 can run
I0528 09:50:19.327570    5454 executor.go:103] Tasks: 95 done / 95 total; 0 can run
I0528 09:50:19.348221    5454 target.go:312] Terraform output is in .
I0528 09:50:19.573892    5454 update_cluster.go:291] Exporting kubecfg for cluster
kops has set your kubectl context to dev2.k8s.local

Terraform output has been placed into .

Changes may require instances to restart: kops rolling-update cluster

So it still has to recreate the master cert

@fejta-bot

This comment has been minimized.

Copy link

commented Aug 26, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.