New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: apply retries to all API calls in idempotency.go #123271
kubeadm: apply retries to all API calls in idempotency.go #123271
Conversation
sending this initial PR as WIP to check if the kind jobs are happy. |
/cc |
763b684
to
d533d70
Compare
3fd4778
to
a2957ae
Compare
a2957ae
to
2941a64
Compare
this is ready for review. there is a chance that something from this PR will break our e2e test assumption, but generally "retry all the things" is what we want. it has been discussed a few times in the past...it's just that nobody took action to send the code change. /uncc @chendave |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be easier to review by looking at the files instead of diff
// Override the default timeouts to be shorter | ||
defaultTimeouts := kubeadmapi.GetActiveTimeouts() | ||
defaultAPICallTimeout := defaultTimeouts.KubernetesAPICall | ||
defaultTimeouts.KubernetesAPICall = &metav1.Duration{Duration: time.Microsecond * 500} | ||
defer func() { | ||
defaultTimeouts.KubernetesAPICall = defaultAPICallTimeout | ||
}() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing these overrides for tests seems OK,
i ran time go test ...
and saw no major increase in time for our unit tests.
@@ -81,131 +95,189 @@ func CreateOrMutateConfigMap(client clientset.Interface, cm *v1.ConfigMap, mutat | |||
return lastError | |||
} | |||
|
|||
// MutateConfigMap takes a ConfigMap Object Meta (namespace and name), retrieves the resource from the server and tries to mutate it | |||
// mutateConfigMap takes a ConfigMap Object Meta (namespace and name), retrieves the resource from the server and tries to mutate it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making this private as it was not used anywhere
if !apierrors.IsNotFound(err) { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these CreateOrRetain* functions had the wrong logic IMO, but please check if i am correct.
the logic used to be:
- if the error is not a "not found " error return success.
- if the error is a "not found" error create the object
i don't think 1. is correct, so i changed it:
- if the error is not a "not found" error retry, it could be a connectivity error
- same as above
these were added when the coredns migration logic was added, so hopefully they don't break anything.
upgrades work for me locally, but let's see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
CreateOrRetainDeployment
is an example.
err := wait.PollUntilContextTimeout(context.Background(), | ||
apiCallRetryInterval, kubeadmapi.GetActiveTimeouts().KubernetesAPICall.Duration, | ||
true, func(_ context.Context) (bool, error) { | ||
ctx := context.Background() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is explained in the release note too.
we don't want to use the context passed from here:
func(_ context.Context) (bool, error) {
because it changes expectations of the caller site if they are expecting e.g. an API error, but instead would get a context timeout error as the "last error"
os.Exit(exitVal) | ||
} | ||
|
||
func TestCreateOrUpdateConfigMap(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new tests added for everything.
coverage for this file was around 97%
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
if !apierrors.IsNotFound(err) { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
CreateOrRetainDeployment
is an example.
@@ -139,7 +939,7 @@ func TestPatchNode(t *testing.T) { | |||
} | |||
}, &lastError) | |||
success, err := conditionFunction(context.Background()) | |||
if err != nil { | |||
if err != nil && success { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tc.success
is more readable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM label has been added. Git tree hash: c931b8898ccebdad5ec93c091e89a256bf3d69f1
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: neolit123, pacoxu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The idempotency.go (perhaps not so accurately named) contains API calls that kubeadm does against an API server using client-go. Some users seem to have unstable setups where for unknown reasons the API server can be unavailable or refuse to respond as expected. Use PollUntilContextTimeout in all exported functions to ensure such API calls are all retry-able. NOTE: The context passed to PollUntilContextTimeout is not propagated in the polled function. Instead the poll function creates it's own context 'ctx := context.Background()', this is to avoid breaking expectations on the side of the callers, that expect a certain type of error and not "context timeout" errors. Additional changes: - Make all context.TODO() -> context.Background() - Update all unit tests and make sure during testing the retry interval and timeout are short. Test coverage of idempotency.go is at ~97%. - Remove the TestMutateConfigMapWithConflict test. It does not contribute much, because conflict handling is done at the API, server side, not on the side of kubeadm. This simulating this is not needed.
2941a64
to
c29450e
Compare
/lgtm |
LGTM label has been added. Git tree hash: c331b4d286def46a53c2ce1d22aab6dda8eecf1f
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/unhold |
let's run this in e2e CI and watch for problems so that i can fix them before CF. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The idempotency.go (perhaps not so accurately named) contains API calls that kubeadm does against an API server using client-go.
Some users seem to have unstable setups where for unknown reasons the API server can be unavailable or refuse to respond as expected.
Use PollUntilContextTimeout in all exported functions to ensure such API calls are all retry-able.
NOTE: The context passed to PollUntilContextTimeout is not propagated in the polled function. Instead the poll function creates it's own context 'ctx := context.Background()', this is to avoid breaking expectations on the side of the callers, that expect a certain type of error and not "context timeout" errors.
Additional changes:
Which issue(s) this PR fixes:
Fixes kubernetes/kubeadm#1606
Special notes for your reviewer:
NONE
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: