Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE: Testing context cancellation and status prioritization together #1163

Conversation

smarterclayton
Copy link

#1161, #1162 should significantly reduce perceived end to end latency of startup and shutdown

If CRI returns a container that has been created but is not running,
it is not safe to assume it is terminal, as our connection to CRI
may have failed. Instead, created is treated as waiting, as in
"waiting for this container to start". Either syncPod or
syncTerminatingPod is responsible for handling this state.
In preparation for allowing `sync*Pod` methods to be cancelled when
the pod transitions to terminating, pass context to the appropriate
methods in the Kubelet that might need to be cancelled within a
deadline or due to user input. Does not change the behavior of those
functions.

Change interface methods and stored structs for easier review.
In preparation for allowing `sync*Pod` methods to be cancelled when
the pod transitions to terminating, pass context to the appropriate
methods in the Kubelet that might need to be cancelled within a
deadline or due to user input. Does not change the behavior of those
functions.

Propagate core long running methods (CRI, GC, streaming) up out of
methods towards the top-level. Methods with context imply remote
invocations of CRI and so the context is propagated up until it
hits either a method carrying a context (such as HTTP servers,
or `sync*Pod` which will perform cancellation), a top level wait
loop, or a boundary with a subsystem that does not clearly deserve
a context propagation.  Top level loops get context.Background()
and the rest get context.TODO(). This commits contains all such
transitions, and subsequent PRs are propagating context only.
In preparation for allowing `sync*Pod` methods to be cancelled when
the pod transitions to terminating, pass context to the appropriate
methods in the Kubelet that might need to be cancelled within a
deadline or due to user input. Does not change the behavior of those
functions.

For the remote CRI service, remove the wrappers that injected a new
context and call the direct context equivalents for timeout.
In preparation for allowing `sync*Pod` methods to be cancelled when
the pod transitions to terminating, pass context to the appropriate
methods in the Kubelet that might need to be cancelled within a
deadline or due to user input. Does not change the behavior of those
functions.

Contains all propagation of context upwards when the parent method
either now passes context, or context was already present.
In preparation for allowing `sync*Pod` methods to be cancelled when
the pod transitions to terminating, pass context to the appropriate
methods in the Kubelet that might need to be cancelled within a
deadline or due to user input. Does not change the behavior of those
functions.

Update test methods to pass contexts where changed.
@openshift-ci-robot openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Feb 1, 2022
@openshift-ci-robot
Copy link

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci
Copy link

openshift-ci bot commented Feb 1, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: smarterclayton
To complete the pull request process, please assign soltysh after the PR has been reviewed.
You can assign the PR to them by writing /assign @soltysh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the vendor-update Touching vendor dir or related files label Feb 1, 2022
None of the refactors touch it as it is deleted upstream
@openshift-ci-robot
Copy link

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

Track how long it takes for pod updates to propagate from detection
to successful change on API server. Will guide future improvements
in pod start and shutdown latency.
Streamline the pod status manager to track the set of updated pods
instead of using a buffered channel. Remove the time the pod status
lock is held by moving other expensive checks out of the loop, which
also opens the door for parallelizing the status queue later. Avoid
making some checks twice now that syncPod is only called from
syncBatch. Protect apiStatusVersions under the pod status lock as
well to prevent accidents.
Some pod status transitions directly impact end-to-end user latency
in the Kubelet, such as pods going ready, going unready, or becoming
Succeeded or Failed.

Prioritize the order that pods are updated in to minimize that
latency.
@openshift-ci-robot
Copy link

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci
Copy link

openshift-ci bot commented Feb 4, 2022

@smarterclayton: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-cgroupsv2 af16f13 link false /test e2e-aws-cgroupsv2
ci/prow/unit af16f13 link true /test unit
ci/prow/verify af16f13 link true /test verify
ci/prow/e2e-gcp-upgrade af16f13 link true /test e2e-gcp-upgrade
ci/prow/e2e-agnostic-cmd af16f13 link false /test e2e-agnostic-cmd
ci/prow/e2e-aws-serial af16f13 link true /test e2e-aws-serial
ci/prow/verify-commits af16f13 link true /test verify-commits

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci
Copy link

openshift-ci bot commented Feb 17, 2022

@smarterclayton: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2022
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 17, 2022
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jul 17, 2022
@openshift-ci
Copy link

openshift-ci bot commented Jul 17, 2022

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants