Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure CoreDNS running when Corefile migration doesn't support current version #88811

Merged
merged 1 commit into from Mar 18, 2020

Conversation

rajansandeep
Copy link
Contributor

@rajansandeep rajansandeep commented Mar 4, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR aims to fix a bug in the current released version of kubeadm that had 2 consequences:

  1. After a successful corefile migration, the current deployment of CoreDNS was not pointed to the backup copy, such that if the Deployment rollout of the new version of CoreDNS failed, a rollback could fail (if the new config was not compatible with the old version).
  2. After a failed corefile migration, the deployment of CoreDNS was left in a failed rollout state. This was issue Coredns fails to start if Corefile migration fails #88725.
    Which issue(s) this PR fixes:

Fixes #88725

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 4, 2020
@k8s-ci-robot k8s-ci-robot added area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 4, 2020
// Since the current configuration present is not the default version, try and migrate it.
updatedCorefile, err := migration.Migrate(currentInstalledCoreDNSVersion, kubeadmconstants.CoreDNSVersion, corefile, false)
if err != nil {
return errors.Wrap(err, "unable to migrate CoreDNS ConfigMap")
}

// Point the CoreDNS deployment to the `Corefile-backup` data.
if err := patchCoreDNSDeployment(client, "Corefile-backup"); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step points CoreDNS Deployment to a Configmap value that doesn't exist until the next step (or exists from a prior kubeadm upgrade). IMO, we should create the backup first, then point the deployment to it.

Minor nit: the name of the function patchCoreDNSDeployment() seems too generic for what it does. I think setCorefile() or something like that would be better.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 4, 2020
@neolit123
Copy link
Member

/lgtm
/approve
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2020
@neolit123
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 4, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neolit123, rajansandeep

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2020
@rajansandeep rajansandeep changed the title Ensure CoreDNS running when migration fails Ensure CoreDNS running when Corefile migration fails Mar 4, 2020
if err := setCorefile(client, "Corefile-backup"); err != nil {
return err
}

fmt.Println("[addons] Migrating CoreDNS Corefile")
changes, err := migration.Deprecated(currentInstalledCoreDNSVersion, kubeadmconstants.CoreDNSVersion, corefile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should setCorefile() be after this too? Just in case this fails?

Copy link
Contributor

@rosti rosti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajansandeep !
I have a bit of a problem here. migrateCoreDNSCorefile is supposed to be only called whenever IsCoreDNSConfigMapMigrationRequired returns true. I do think that the latter function is really supposed to return true and migration to be performed whenever CoreDNS is surely not going to start with the old config (hence a migration is required).
I am not OK to do unnecessary migrations as these can change a Corefile beyond recognition and can cause problems like the one with the failed migration.
If a migration is failing and it's required (CoreDNS can't work without it), then a failure to start is genuine.
With the above in mind, it's probably better to investigate why IsCoreDNSConfigMapMigrationRequired is returning true more often than it shouldn't.
/hold

@chrisohaver
Copy link
Contributor

IsCoreDNSConfigMapMigrationRequired is perhaps poorly named. It's function is to determine whether or not we can deploy a default CoreDNS config over top of the existing CoreDNS config. IOW, if the current CoreDNS config is at default (not modified), then we don't need to migrate the default, we can just re-deploy the new default over top of it.

@rajansandeep rajansandeep changed the title Ensure CoreDNS running when Corefile migration fails Ensure CoreDNS running when Corefile migration doesn't support current version Mar 4, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2020
Copy link
Contributor

@chrisohaver chrisohaver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rosti
Copy link
Contributor

rosti commented Mar 6, 2020

My point here is that during upgrades we should keep the behavior of CoreDNS the same as in the previous Kubernetes/CoreDNS version. Migration should only result in a new config if there are some unavoidable changes - config format change, drop of unsupported plugin and its replacement with a supported one.
In short a failed migration is a genuine error as I expect that the old config won't work with the new CoreDNS. Doing unnecessary migrations during upgrade (including replacing one default config with another) has the ability to change the CoreDNS behavior in between versions. Thus a cluster might get broken, slower, etc.
@neolit123 @fabriziopandini @ereslibre WDYT?

@neolit123
Copy link
Member

it makes sense to me to only migrate when actually needed, yet it don't know how easy it is to distinguish that.

@chrisohaver
Copy link
Contributor

chrisohaver commented Mar 6, 2020

In short a failed migration is a genuine error as I expect that the old config won't work with the new CoreDNS.

@rosti, In this PR, if migration.Migrate() returns an error, we leave CoreDNS unchanged: at the prior version, running on the old config. You are kind of implying that is not the case, so just making sure this is understood. Prior to this PR, this was not the case due to unintended code bug, not intentional design.

Regarding only making changes when necessary: migration.Migrate(), already does very close to what you describe. As called, it makes the following changes to a Corefile:

  1. It removes plugins and options that are no longer recognized and if left in place would cause the new version of CoreDNS to fail to start. These are absolutely required.
  2. It removes options from plugins that are ignored by CoreDNS. This could be skipped in theory. But removing an option that is ignored will not change behavior.
  3. It removes comments and standardizes indentation. This behavior is due to the way the corefile is parsed, and is hard to change. However, changes to indentation and comments will not functionally change a Corefile.

2 and 3 are IMO minor points because they do not change functionality. However, we could avoid them in part by inspecting the list of deprecations first. e.g. We could make the call to migration.Deprecated() earlier, loop through the results and only proceed to attempt to migrate if we see any "removed" options/plugins. However I don't think this applies to the scope of this PR, which is a bug fix.

This PR aims to fix a bug in the current released version of kubeadm that had 2 consequences:

  1. After a successful corefile migration, the current deployment of CoreDNS was not pointed to the backup copy, such that if the Deployment rollout of the new version of CoreDNS failed, a rollback could fail (if the new config was not compatible with the old version).
  2. After a failed corefile migration, the deployment of CoreDNS was left in a failed rollout state. This was issue Coredns fails to start if Corefile migration fails #88725.

@rosti
Copy link
Contributor

rosti commented Mar 9, 2020

Thanks for the clarification @chrisohaver !

In short a failed migration is a genuine error as I expect that the old config won't work with the new CoreDNS.

@rosti, In this PR, if migration.Migrate() returns an error, we leave CoreDNS unchanged: at the prior version, running on the old config. You are kind of implying that is not the case, so just making sure this is understood. Prior to this PR, this was not the case due to unintended code bug, not intentional design.

We seem to be on slightly different pages here. I was shocked that a fix for a failed migration is to leave the old config intact and the new CoreDNS version would simply pick it up and use it. The question I was asking myself is why did we need to attempt to perform the migration in first place? However, I saw that you already answered that below.

Regarding only making changes when necessary: migration.Migrate(), already does very close to what you describe. As called, it makes the following changes to a Corefile:

  1. It removes plugins and options that are no longer recognized and if left in place would cause the new version of CoreDNS to fail to start. These are absolutely required.
  2. It removes options from plugins that are ignored by CoreDNS. This could be skipped in theory. But removing an option that is ignored will not change behavior.
  3. It removes comments and standardizes indentation. This behavior is due to the way the corefile is parsed, and is hard to change. However, changes to indentation and comments will not functionally change a Corefile.

The interesting case here is 1. It's obvious - CoreDNS won't start if we have an old plugin.
2 helps us maintain good Corefile hygiene and 3 might be considered a bit dangerous (upon upgrade we remove user's comments).

2 and 3 are IMO minor points because they do not change functionality. However, we could avoid them in part by inspecting the list of deprecations first. e.g. We could make the call to migration.Deprecated() earlier, loop through the results and only proceed to attempt to migrate if we see any "removed" options/plugins. However I don't think this applies to the scope of this PR, which is a bug fix.

That's ideal for us. We don't want to silently skip telling the user that a migration error happened when the migration was supposed to remove an old plugin and make the config working with a the new CoreDNS version.

@rajansandeep Now that we are on the same page, I think the only thing left to do here is to make the earlier migration.Deprecated() call with a loop (as @chrisohaver suggested).

Copy link
Contributor

@rosti rosti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajansandeep !
I think, that I am happy with it. Can you please, squash your commits, so I can apply the LGTM label?

add an additional check for coredns image sha

add a check to see if migration is required
Copy link
Contributor

@rosti rosti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajansandeep !
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2020
@rajansandeep
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2020
@rajansandeep
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Coredns fails to start if Corefile migration fails
5 participants