Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: apply retries to all API calls in idempotency.go #123271

Merged
merged 1 commit into from Feb 19, 2024

Conversation

neolit123
Copy link
Member

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

The idempotency.go (perhaps not so accurately named) contains API calls that kubeadm does against an API server using client-go.

Some users seem to have unstable setups where for unknown reasons the API server can be unavailable or refuse to respond as expected.

Use PollUntilContextTimeout in all exported functions to ensure such API calls are all retry-able.

NOTE: The context passed to PollUntilContextTimeout is not propagated in the polled function. Instead the poll function creates it's own context 'ctx := context.Background()', this is to avoid breaking expectations on the side of the callers, that expect a certain type of error and not "context timeout" errors.

Additional changes:

  • Make all context.TODO() -> context.Background()
  • Update all unit tests and make sure during testing the retry interval and timeout are short
  • Remove the TestMutateConfigMapWithConflict test. It does not contribute much, because conflict handling is done at the API, server side, not on the side of kubeadm. This simulating this is not needed.

Which issue(s) this PR fixes:

Fixes kubernetes/kubeadm#1606

Special notes for your reviewer:

NONE

Does this PR introduce a user-facing change?

kubeadm: make sure that a variety of API server requests are retried during "init", "join", "upgrade", "reset" workflows. Prior to this change some API server requests, such as, creating or updating ConfigMaps were "one-shot" - i.e. they could fail if the API server dropped connectivity for a very short period of time.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 13, 2024
@k8s-ci-robot k8s-ci-robot added area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 13, 2024
@neolit123
Copy link
Member Author

sending this initial PR as WIP to check if the kind jobs are happy.
a variety of tests needs to be updated here.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 13, 2024
@pacoxu
Copy link
Member

pacoxu commented Feb 15, 2024

/cc

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 15, 2024
@neolit123 neolit123 force-pushed the 1.30-retry-all-api-calls branch 2 times, most recently from 3fd4778 to a2957ae Compare February 15, 2024 19:57
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 15, 2024
@neolit123
Copy link
Member Author

this is ready for review. there is a chance that something from this PR will break our e2e test assumption, but generally "retry all the things" is what we want. it has been discussed a few times in the past...it's just that nobody took action to send the code change.

/uncc @chendave
/cc @my-git9
(i wrote most of the unit tests from scratch)
/cc @SataQiu

@k8s-ci-robot k8s-ci-robot requested review from my-git9 and removed request for chendave February 16, 2024 07:01
Copy link
Member Author

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be easier to review by looking at the files instead of diff

Comment on lines +118 to +125
// Override the default timeouts to be shorter
defaultTimeouts := kubeadmapi.GetActiveTimeouts()
defaultAPICallTimeout := defaultTimeouts.KubernetesAPICall
defaultTimeouts.KubernetesAPICall = &metav1.Duration{Duration: time.Microsecond * 500}
defer func() {
defaultTimeouts.KubernetesAPICall = defaultAPICallTimeout
}()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing these overrides for tests seems OK,
i ran time go test ... and saw no major increase in time for our unit tests.

@@ -81,131 +95,189 @@ func CreateOrMutateConfigMap(client clientset.Interface, cm *v1.ConfigMap, mutat
return lastError
}

// MutateConfigMap takes a ConfigMap Object Meta (namespace and name), retrieves the resource from the server and tries to mutate it
// mutateConfigMap takes a ConfigMap Object Meta (namespace and name), retrieves the resource from the server and tries to mutate it
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making this private as it was not used anywhere

Comment on lines -114 to -116
if !apierrors.IsNotFound(err) {
return nil
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these CreateOrRetain* functions had the wrong logic IMO, but please check if i am correct.

the logic used to be:

  1. if the error is not a "not found " error return success.
  2. if the error is a "not found" error create the object

i don't think 1. is correct, so i changed it:

  1. if the error is not a "not found" error retry, it could be a connectivity error
  2. same as above

these were added when the coredns migration logic was added, so hopefully they don't break anything.
upgrades work for me locally, but let's see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
CreateOrRetainDeployment is an example.

err := wait.PollUntilContextTimeout(context.Background(),
apiCallRetryInterval, kubeadmapi.GetActiveTimeouts().KubernetesAPICall.Duration,
true, func(_ context.Context) (bool, error) {
ctx := context.Background()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is explained in the release note too.
we don't want to use the context passed from here:

 func(_ context.Context) (bool, error) {

because it changes expectations of the caller site if they are expecting e.g. an API error, but instead would get a context timeout error as the "last error"

os.Exit(exitVal)
}

func TestCreateOrUpdateConfigMap(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new tests added for everything.
coverage for this file was around 97%

@neolit123
Copy link
Member Author

/hold
/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 16, 2024
@neolit123 neolit123 changed the title WIP: kubeadm: apply retries to all API calls in idempotency.go kubeadm: apply retries to all API calls in idempotency.go Feb 16, 2024
@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 16, 2024
Copy link
Member

@pacoxu pacoxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Comment on lines -114 to -116
if !apierrors.IsNotFound(err) {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
CreateOrRetainDeployment is an example.

@@ -139,7 +939,7 @@ func TestPatchNode(t *testing.T) {
}
}, &lastError)
success, err := conditionFunction(context.Background())
if err != nil {
if err != nil && success {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tc.success is more readable to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c931b8898ccebdad5ec93c091e89a256bf3d69f1

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neolit123, pacoxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The idempotency.go (perhaps not so accurately named) contains
API calls that kubeadm does against an API server using client-go.

Some users seem to have unstable setups where for unknown reasons
the API server can be unavailable or refuse to respond as expected.

Use PollUntilContextTimeout in all exported functions to ensure
such API calls are all retry-able.

NOTE: The context passed to PollUntilContextTimeout is not propagated
in the polled function. Instead the poll function creates it's own
context 'ctx := context.Background()', this is to avoid
breaking expectations on the side of the callers, that expect
a certain type of error and not "context timeout" errors.

Additional changes:
- Make all context.TODO() -> context.Background()
- Update all unit tests and make sure during testing the retry
interval and timeout are short. Test coverage of idempotency.go
is at ~97%.
- Remove the TestMutateConfigMapWithConflict test. It does not
contribute much, because conflict handling is done at the API,
server side, not on the side of kubeadm. This simulating this is not
needed.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2024
@pacoxu
Copy link
Member

pacoxu commented Feb 18, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c331b4d286def46a53c2ce1d22aab6dda8eecf1f

Copy link
Member

@SataQiu SataQiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@my-git9 my-git9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@my-git9 my-git9 removed their assignment Feb 19, 2024
@pacoxu
Copy link
Member

pacoxu commented Feb 19, 2024

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2024
@neolit123
Copy link
Member Author

let's run this in e2e CI and watch for problems so that i can fix them before CF.

@k8s-ci-robot k8s-ci-robot merged commit 7225dc6 into kubernetes:master Feb 19, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

evaluate the retry logic for API calls
5 participants