Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow ebs volume TF resource names to begin with digit #10424

Merged
merged 1 commit into from
Jan 5, 2021

Conversation

rifelpet
Copy link
Member

Fixes #9982

This will fix the issue for both new and existing clusters and only the clusters that are directly impacted by the bug. We could theoretically merge this as well as #10361.

My main concern with this approach is if users don't terraform state mv on existing clusters and Terraform deletes and recreates the EBS volumes; their entire etcd state will be lost. We can try to make the release note more prominent but I'm open to other ideas on how to try to prevent people from making that mistake.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 15, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/documentation area/provider/aws Issues or PRs related to aws provider approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 15, 2020
@hakman
Copy link
Member

hakman commented Dec 15, 2020

I don't have a strong preference about this issue or which fix to merge.
To me seems safer to merge #10361 and update the release notes in this PR to explain how to rename the etcd members manually. Easier to fix after bumping into the issue than after deleting and recreating the EBS volumes.

@rifelpet
Copy link
Member Author

Have we come up with instructions for how to rename an existing cluster's members without disruption?

@hakman
Copy link
Member

hakman commented Dec 15, 2020

Not that I know of.

@rifelpet
Copy link
Member Author

I believe that renaming the members in the ClusterSpec would result in the terraform resource names changing, so users would need to terraform state mv regardless in order to avoid data loss.

@hakman
Copy link
Member

hakman commented Dec 15, 2020

I agree, but at least they would read the release notes first.
With automatic renaming, if terraform state mv is not run, the only thing that can be done is to restore the last backup (if I understand correctly).

@rifelpet rifelpet force-pushed the ebs-tf-012 branch 2 times, most recently from 0c17d05 to 94ea0fd Compare December 15, 2020 05:11
@rifelpet
Copy link
Member Author

rifelpet commented Dec 15, 2020

We know that simply renaming the etcd members in the ClusterSpec is disruptive to the etcd cluster. Therefore I don't think we can recommend that as a solution in the release notes (along with a terraform state mv unless we come up with a nondisruptive migration path.

I just added text to the top of the release notes to make it more apparent.

I likely won't have the opportunity to experiment with etcd member renaming before we want 1.19.0 released so if anyone else wants to give it a shot and try to come up with a graceful renaming strategy, it would be much appreciated.

@rifelpet
Copy link
Member Author

I think setting terraform's prevent_destroy on the ebs volumes might help here.

https://www.terraform.io/docs/configuration/meta-arguments/lifecycle.html#prevent_destroy

We'll have to see if it works with only the new version of the resource having it set rather than the old version of the resource having it set.

We'll need to also consider any workflows that might break by having this set.

@justinsb
Copy link
Member

I did check whether terraform 0.12upgrade established a convention here, it does not (or did not for me). It instead added this comment:

< # TF-UPGRADE-TODO: In Terraform v0.11 and earlier, it was possible to begin a
< # resource name with a number, but it is no longer possible in Terraform v0.12.
< #
< # Rename the resource and run `terraform state mv` to apply the rename in the
< # state. Detailed information on the `state move` command can be found in the
< # documentation online: https://www.terraform.io/docs/commands/state/mv.html

@justinsb
Copy link
Member

So I was thinking about our options here @rifelpet and @hakman. So I think we could do both PRs...

  • We merge Prefix etcd cluster names with letters #10361 so that new clusters don't have the problem
  • We merge this PR so that we can support tf 0.12. But I propose we add a flag / env-var which users must set, and without that we don't do the automatic rename. We can also change the logic to give a specific error here encouraging setting this env var.

What we'll end up with is no changes for unaffected users, and users that would be affected are unaffected for new clusters. Existing affected clusters will experience a message saying "you must rename using tf state mv and then set the env var KOPS_TERRAFORM_0_12_RENAMED=,". They must continue to pass this env var indefinitely, which I think is probably acceptable ... it's comparable to having to set KOPS_STATE_STORE. If it was more widespread or if this becomes annoying, we can also look at persisting something in the cluster (to indicate that the rename has been done), but I think (1) we can do that in future and (2) we don't know today exactly what that would like look like.

If we agree on this course, I can merge both PRs and then tweak this one to only do the prefixing when an env var is set.

Does that sound good @rifelpet and @hakman ?

@rifelpet
Copy link
Member Author

I think the environment variable approach sounds good, yes. I left a comment in the related GH issue asking for feedback so hopefully we hear from others.

I do worry about whether we have other resources that are susceptible to this problem and whether we'd want to "scope" the variable to just the ebs volumes, perhaps require KOPS_TERRAFORM_0_12_RENAMED=aws_ebs_volume and we can append to that if necessary?

I also wonder if we can prevent this for any new resource types down the road by adding conditional prefixing to the terraform target code.

This is a bit tangential, but I know we've discussed requiring single minor version upgrades for kops in the past and this is one situation that would be easier if we had that requirement. We could add the env var check in kops 1.19 and remove it in 1.20 since we'd know that any clusters being upgraded to kops 1.20 will have been updated by kops 1.19 previously.

@hakman
Copy link
Member

hakman commented Dec 30, 2020

First of all I am ok with merging my PR 😀.

Adding the KOPS_TERRAFORM_0_12_RENAMED env var seems a good idea.
I wouldn't make it to granular, but would just do it for all found resources when needed.
I would keep the flag either forever of based on supported TF version. I don't think that going through each kOps version is that appealing.

@olemarkus
Copy link
Member

It's a different cause, but the security group rule renaming has similar challenge (#10179 and #10188). Here the consequence is less severe though. I believe only a few packets may be delayed.

@justinsb
Copy link
Member

justinsb commented Dec 30, 2020

I think the environment variable approach sounds good, yes. I left a comment in the related GH issue asking for feedback so hopefully we hear from others.

I saw that - thanks!

I do worry about whether we have other resources that are susceptible to this problem and whether we'd want to "scope" the variable to just the ebs volumes, perhaps require KOPS_TERRAFORM_0_12_RENAMED=aws_ebs_volume and we can append to that if necessary?

I also wonder if we can prevent this for any new resource types down the road by adding conditional prefixing to the terraform target code.

I think that's a good idea / point: in theory this could be happening to other resources and we just don't know about it because it's so rare. That said, I think EBS volumes are the one for which this really matters (because stateful) so I think we can probably just scope the env var to EBS volumes (at least for now), and we can likely add the prefix to all other tf resources automatically.

I can play with this in more detail on the "third PR" after we merge this one and #10361 (or if you want to experiment, feel free :-) )

This is a bit tangential, but I know we've discussed requiring single minor version upgrades for kops in the past and this is one situation that would be easier if we had that requirement. We could add the env var check in kops 1.19 and remove it in 1.20 since we'd know that any clusters being upgraded to kops 1.20 will have been updated by kops 1.19 previously.

Good point. I have to admit that I'm less bothered by a few extra lines of code here and there than many others are :-) If we do something like this, it would be nice to guide the user to avoid mistakes... we do have the logic around allow-kops-downgrade but AFAICT that doesn't run in the terraform case. (Edit: fixed the name of the flag)

@rifelpet
Copy link
Member Author

I'm fine with not requiring the env var be scoped to EBS volumes somehow. As mentioned, they're the most critical terraform resource because they're stateful. I agree we should merge both PRs and open a third PR with the env var requirement.

@hakman
Copy link
Member

hakman commented Jan 5, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit 0ca0e38 into kubernetes:master Jan 5, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Jan 5, 2021
k8s-ci-robot added a commit that referenced this pull request Jan 5, 2021
…-upstream-release-1.18

Automated cherry pick of #10424: Don't allow ebs volume TF resource names to begin with digit
k8s-ci-robot added a commit that referenced this pull request Jan 5, 2021
…-upstream-release-1.19

Automated cherry pick of #10424: Don't allow ebs volume TF resource names to begin with digit
@rifelpet rifelpet deleted the ebs-tf-012 branch May 5, 2021 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation area/provider/aws Issues or PRs related to aws provider blocks-next cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
5 participants