Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-auth configmap does not get re-created after a cluster replacement, preventing nodes from joining the cluster #107

Closed
metral opened this issue Apr 23, 2019 · 8 comments · Fixed by pulumi/pulumi-kubernetes#577
Assignees
Labels
kind/bug Some behavior is incorrect or out of spec p1 A bug severe enough to be the next item assigned to an engineer
Milestone

Comments

@metral
Copy link
Contributor

metral commented Apr 23, 2019

During a replacement of an EKS cluster, though the replacement succeeds, the aws-auth configmap used for user / role mappings does not get recreated. This in turn prevents the new worker nodes from joining the cluster.

The aws-auth configmap gets created here. Because none of the IAM resources it depends on get replaced or updated during the cluster replacement, the aws-auth does not need to be replaced or updated either. However, during the tear down of the old cluster, the configMap goes away with the cluster, and the pulumi/kube provider does not seem to notice the need to recreate the aws-auth configMap for the new cluster.

Per discussions offline w/ Luke, the thought was that the kube provider kx-eks-cluster-eks-k8s should have been replaced instead of updated. Since the provider is the only dependency of aws-auth, if the provider were replaced, it would have created aws-auth.

Changes:
 
    Type                           Name                                   Operation
-+  aws:eks:Cluster                kx-eks-cluster-eksCluster              replaced
~   pulumi:providers:kubernetes    kx-eks-cluster-eks-k8s                 updated
~   aws:ec2:SecurityGroup          kx-eks-cluster-nodeSecurityGroup       updated
~   pulumi-nodejs:dynamic:Resource kx-eks-cluster-vpc-cni                 updated
-+  aws:ec2:LaunchConfiguration    kx-eks-cluster-nodeLaunchConfiguration replaced
~   aws:cloudformation:Stack       kx-eks-cluster-nodes                   updated
~   pulumi:providers:kubernetes    kx-eks-cluster-provider                updated
 
Resources:
    +-replaced 2
    ~ updated 5
    18 unchanged

To repro this, we'll use the same code from #69 (comment).

Steps:

  1. Download pulumi-full-aws-eks.zip
  2. Run pulumi up in the unzipped dir
  3. After initial deployment is complete, comment out line Support tagging of instances, ASGs, launch config, etc. #74 subnetIds.pop(), and run another update.
  4. After about ~20 min the EKS replacement onto 3 subnets will complete
  5. kubectl cluster-info returns success
  6. kubectl get pods --all-namespaces returns core-dns Pods in Pending, as there aren't any workers to deploy onto.
  7. kubectl get cm aws-auth -n kube-system -o yaml returns nothing
  8. kubectl get nodes -o wide --show-labels returns nothing

/cc @lukehoban @hausdorff

@metral metral changed the title aws-auth configmap does not get re-created on after a cluster replacement, preventing nodes from joining the cluster aws-auth configmap does not get re-created after a cluster replacement, preventing nodes from joining the cluster Apr 23, 2019
@lukehoban
Copy link
Member

Per discussions offline w/ Luke, the thought was that the kube provider kx-eks-cluster-eks-k8s should have been replaced instead of updated.

Yes. The root issue here is that the Kubernetes provider does an update instead of a replace when the kubeconfig input changes, and as a result, none of the resources deployed using that provider realize they need to change.

It looks like this was intentional - and acknowledged that this is not a conservative decision (and hence can result in the repro in this issue):

https://github.com/pulumi/pulumi/blob/06d4268137b1ab91037d946dee2c1269d6f48c56/pkg/resource/plugin/provider_plugin.go#L107-L119

// DiffConfig checks what impacts a hypothetical change to this provider's configuration will have on the provider.
func (p *provider) DiffConfig(olds, news resource.PropertyMap) (DiffResult, error) {
	// There are two interesting scenarios with the present gRPC interface:
	// 1. Configuration differences in which all properties are known
	// 2. Configuration differences in which some new property is unknown.
	//
	// In both cases, we return a diff result that indicates that the provider _should not_ be replaced. Although this
	// decision is not conservative--indeed, the conservative decision would be to always require replacement of a
	// provider if any input has changed--we believe that it results in the best possible user experience for providers
	// that do not implement DiffConfig functionality. If we took the conservative route here, any change to a
	// provider's configuration (no matter how inconsequential) would cause all of its resources to be replaced. This
	// is clearly a bad experience, and differs from how things worked prior to first-class providers.
	return DiffResult{Changes: DiffUnknown, ReplaceKeys: nil}, nil
}

One approach to solving this would, I believe, be to correctly implement DiffConfig (and CheckConfig) for the Kubernetes provider, to indicate replacements needed on changes to kubeconfig (or perhaps only if we can identify that the target cluster is actually different?).

https://github.com/pulumi/pulumi-kubernetes/blob/d0f00cf432c7a63847afe5ae7b1a598d0523a7f2/pkg/provider/provider.go#L111-L114

// DiffConfig diffs the configuration for this provider.
func (k *kubeProvider) DiffConfig(ctx context.Context, req *pulumirpc.DiffRequest) (*pulumirpc.DiffResponse, error) {
	return nil, status.Error(codes.Unimplemented, "DiffConfig is not yet implemented")
}

The ability to define a custom DiffConfig was just added recently in pulumi/pulumi#2512, but it looks like we haven't yet implemented that in the Kubernetes provider.

@lukehoban lukehoban self-assigned this Apr 23, 2019
@lukehoban lukehoban added this to the 0.23 milestone Apr 23, 2019
@metral metral added the bug label Apr 30, 2019
@lukehoban
Copy link
Member

@lblackstone I believe this is an issue in the core Pulumi Kubernetes provider, which we'll likely need to fix by implementing DiffConfig as noted above.

@lblackstone
Copy link
Member

The required plumbing for this work isn't done yet:

matt [10:14 AM]
I believe that Pat added these the last time he was touching the RPC interface, as he felt he was going to need the RPCs to support something (either related to first class providers or for dealing with changes to default provider configuration) but he never actually implemented them either in the bridge or updated Pulumi to use them.

luke [10:18 AM]
That’s my understanding too. This is part of https://github.com/pulumi/pulumi/issues/1718.  

My understanding is that when 1st class providers were first implemented, we didn’t want a forced breaking change to the RPC interface, so we provided default behavior for diffing provider configuration (which we knew was not in general correct - there is a code comment to this affect). The desire was to expose optional config check/diff to allow providers to participate in this decision. It looks like we did a part of that work right before Pat left, but not the plumbing through to the provider config codepath.

See pulumi/pulumi#1718

@lblackstone
Copy link
Member

luke [10:26 AM]
The engine code linked in that issue may be the only place that needs to be updated. I believe the DiffConfig and CheckConfig there need to be updated to do something similar to what the Diff and Check there currently do, but calling the underlying providers DiffConfig/CheckConfig respectively.

https://github.com/pulumi/pulumi/blob/06d4268137b1ab91037d946dee2c1269d6f48c56/pkg/resource/plugin/provider_plugin.go#L87-L119
https://github.com/pulumi/pulumi/blob/06d4268137b1ab91037d946dee2c1269d6f48c56/pkg/resource/plugin/provider_plugin.go#L183-L318

@metral metral self-assigned this May 10, 2019
metral added a commit to metral/pulumi-eks that referenced this issue May 10, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 10, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 10, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 10, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 10, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 10, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 10, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 10, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 10, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 11, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 11, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 11, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 11, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 14, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 14, 2019
Note: this test works, but it is currently disabled until
pulumi#107 is resolved.
metral added a commit to metral/pulumi-eks that referenced this issue May 14, 2019
Note: this test works, but it is currently disabled until pulumi#107 is resolved.
@lukehoban
Copy link
Member

This is dependent on pulumi/pulumi#2766 and then on updating the implementations of DiffConfig/CheckConfig in the Kubernetes provider.

@lukehoban lukehoban assigned ellismg and unassigned metral May 28, 2019
@ellismg
Copy link
Contributor

ellismg commented May 30, 2019

pulumi/pulumi#2766 is now merged, So you should be able to pick up a dev build and run with it, @lblackstone. Let me know if you need help.

@lblackstone
Copy link
Member

This fix requires 0.17.15 of the pulumi CLI, and a k8s provider release including the 577 changes (currently available in the dev package, and will be included in the upcoming 0.24.0 release).

@metral
Copy link
Contributor Author

metral commented Jun 7, 2019

Thanks to the fixes in @pulumi/pulumi v0.17.16 and @pulumi/kubernetes v0.24.0, this bug & scenario is now being added to the set of tests.

@infin8x infin8x added p1 A bug severe enough to be the next item assigned to an engineer kind/bug Some behavior is incorrect or out of spec labels Jul 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Some behavior is incorrect or out of spec p1 A bug severe enough to be the next item assigned to an engineer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants