DO NOT MERGE: Testing context cancellation and status prioritization together #1163

smarterclayton · 2022-02-01T23:28:55Z

#1161, #1162 should significantly reduce perceived end to end latency of startup and shutdown

If CRI returns a container that has been created but is not running, it is not safe to assume it is terminal, as our connection to CRI may have failed. Instead, created is treated as waiting, as in "waiting for this container to start". Either syncPod or syncTerminatingPod is responsible for handling this state.

In preparation for allowing `sync*Pod` methods to be cancelled when the pod transitions to terminating, pass context to the appropriate methods in the Kubelet that might need to be cancelled within a deadline or due to user input. Does not change the behavior of those functions. Change interface methods and stored structs for easier review.

In preparation for allowing `sync*Pod` methods to be cancelled when the pod transitions to terminating, pass context to the appropriate methods in the Kubelet that might need to be cancelled within a deadline or due to user input. Does not change the behavior of those functions. Propagate core long running methods (CRI, GC, streaming) up out of methods towards the top-level. Methods with context imply remote invocations of CRI and so the context is propagated up until it hits either a method carrying a context (such as HTTP servers, or `sync*Pod` which will perform cancellation), a top level wait loop, or a boundary with a subsystem that does not clearly deserve a context propagation. Top level loops get context.Background() and the rest get context.TODO(). This commits contains all such transitions, and subsequent PRs are propagating context only.

In preparation for allowing `sync*Pod` methods to be cancelled when the pod transitions to terminating, pass context to the appropriate methods in the Kubelet that might need to be cancelled within a deadline or due to user input. Does not change the behavior of those functions. For the remote CRI service, remove the wrappers that injected a new context and call the direct context equivalents for timeout.

In preparation for allowing `sync*Pod` methods to be cancelled when the pod transitions to terminating, pass context to the appropriate methods in the Kubelet that might need to be cancelled within a deadline or due to user input. Does not change the behavior of those functions. Contains all propagation of context upwards when the parent method either now passes context, or context was already present.

In preparation for allowing `sync*Pod` methods to be cancelled when the pod transitions to terminating, pass context to the appropriate methods in the Kubelet that might need to be cancelled within a deadline or due to user input. Does not change the behavior of those functions. Update test methods to pass contexts where changed.

openshift-ci-robot · 2022-02-01T23:29:03Z

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

3eefaae|kubelet: Pass context down to long running methods (1/5): does not specify an upstream backport in the commit message
5582ff3|kubelet: Streamline Kubelet pod status reporting: does not specify an upstream backport in the commit message
6e1a028|kubelet: Record a metric for latency of pod status update: does not specify an upstream backport in the commit message
8254393|kubelet: Pass context down to long running methods (4/5): does not specify an upstream backport in the commit message
a7c801b|kubelet: If the container status is created, we are waiting: does not specify an upstream backport in the commit message
a8fc39b|kubelet: Prioritize certain pod status updates: does not specify an upstream backport in the commit message
c5822d6|kubelet: Pass context down to long running methods (2/5): does not specify an upstream backport in the commit message
d699e5f|kubelet: Pass context down to long running methods (5/5): does not specify an upstream backport in the commit message
f5ff48b|kubelet: Pass context down to long running methods (3/5): does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci · 2022-02-01T23:29:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: smarterclayton
To complete the pull request process, please assign soltysh after the PR has been reviewed.
You can assign the PR to them by writing /assign @soltysh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

DOWNSTREAM_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

None of the refactors touch it as it is deleted upstream

openshift-ci-robot · 2022-02-01T23:49:01Z

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

3eefaae|kubelet: Pass context down to long running methods (1/5): does not specify an upstream backport in the commit message
655bc69|DO NOT MERGE: Comment out the dockershim: does not specify an upstream backport in the commit message
8254393|kubelet: Pass context down to long running methods (4/5): does not specify an upstream backport in the commit message
a7c801b|kubelet: If the container status is created, we are waiting: does not specify an upstream backport in the commit message
a8b117f|kubelet: Record a metric for latency of pod status update: does not specify an upstream backport in the commit message
c5618c9|kubelet: Prioritize certain pod status updates: does not specify an upstream backport in the commit message
c5822d6|kubelet: Pass context down to long running methods (2/5): does not specify an upstream backport in the commit message
d699e5f|kubelet: Pass context down to long running methods (5/5): does not specify an upstream backport in the commit message
dc51fba|kubelet: Streamline Kubelet pod status reporting: does not specify an upstream backport in the commit message
f5ff48b|kubelet: Pass context down to long running methods (3/5): does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci-robot · 2022-02-02T19:06:45Z

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

031d339|kubelet: Prioritize certain pod status updates: does not specify an upstream backport in the commit message
33262f0|kubelet: Streamline Kubelet pod status reporting: does not specify an upstream backport in the commit message
368ee3e|kubelet: Record a metric for latency of pod status update: does not specify an upstream backport in the commit message
3eefaae|kubelet: Pass context down to long running methods (1/5): does not specify an upstream backport in the commit message
655bc69|DO NOT MERGE: Comment out the dockershim: does not specify an upstream backport in the commit message
8254393|kubelet: Pass context down to long running methods (4/5): does not specify an upstream backport in the commit message
a7c801b|kubelet: If the container status is created, we are waiting: does not specify an upstream backport in the commit message
c5822d6|kubelet: Pass context down to long running methods (2/5): does not specify an upstream backport in the commit message
d699e5f|kubelet: Pass context down to long running methods (5/5): does not specify an upstream backport in the commit message
f5ff48b|kubelet: Pass context down to long running methods (3/5): does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci-robot · 2022-02-02T21:03:37Z

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

3eefaae|kubelet: Pass context down to long running methods (1/5): does not specify an upstream backport in the commit message
655bc69|DO NOT MERGE: Comment out the dockershim: does not specify an upstream backport in the commit message
6c7d7b3|kubelet: Prioritize certain pod status updates: does not specify an upstream backport in the commit message
721954a|kubelet: Streamline Kubelet pod status reporting: does not specify an upstream backport in the commit message
8254393|kubelet: Pass context down to long running methods (4/5): does not specify an upstream backport in the commit message
a7c801b|kubelet: If the container status is created, we are waiting: does not specify an upstream backport in the commit message
c5822d6|kubelet: Pass context down to long running methods (2/5): does not specify an upstream backport in the commit message
d699e5f|kubelet: Pass context down to long running methods (5/5): does not specify an upstream backport in the commit message
f5ff48b|kubelet: Pass context down to long running methods (3/5): does not specify an upstream backport in the commit message
fa623be|kubelet: Record a metric for latency of pod status update: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

Track how long it takes for pod updates to propagate from detection to successful change on API server. Will guide future improvements in pod start and shutdown latency.

Streamline the pod status manager to track the set of updated pods instead of using a buffered channel. Remove the time the pod status lock is held by moving other expensive checks out of the loop, which also opens the door for parallelizing the status queue later. Avoid making some checks twice now that syncPod is only called from syncBatch. Protect apiStatusVersions under the pod status lock as well to prevent accidents.

Some pod status transitions directly impact end-to-end user latency in the Kubelet, such as pods going ready, going unready, or becoming Succeeded or Failed. Prioritize the order that pods are updated in to minimize that latency.

openshift-ci-robot · 2022-02-03T03:20:14Z

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

2b1d41f|kubelet: Streamline Kubelet pod status reporting: does not specify an upstream backport in the commit message
3eefaae|kubelet: Pass context down to long running methods (1/5): does not specify an upstream backport in the commit message
655bc69|DO NOT MERGE: Comment out the dockershim: does not specify an upstream backport in the commit message
6bb6cb3|kubelet: Prioritize certain pod status updates: does not specify an upstream backport in the commit message
8254393|kubelet: Pass context down to long running methods (4/5): does not specify an upstream backport in the commit message
a7c801b|kubelet: If the container status is created, we are waiting: does not specify an upstream backport in the commit message
b7d8ccc|kubelet: Record a metric for latency of pod status update: does not specify an upstream backport in the commit message
c5822d6|kubelet: Pass context down to long running methods (2/5): does not specify an upstream backport in the commit message
d699e5f|kubelet: Pass context down to long running methods (5/5): does not specify an upstream backport in the commit message
f5ff48b|kubelet: Pass context down to long running methods (3/5): does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci-robot · 2022-02-04T20:14:36Z

@smarterclayton: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

2b1d41f|kubelet: Streamline Kubelet pod status reporting: does not specify an upstream backport in the commit message
3eefaae|kubelet: Pass context down to long running methods (1/5): does not specify an upstream backport in the commit message
655bc69|DO NOT MERGE: Comment out the dockershim: does not specify an upstream backport in the commit message
6bb6cb3|kubelet: Prioritize certain pod status updates: does not specify an upstream backport in the commit message
8254393|kubelet: Pass context down to long running methods (4/5): does not specify an upstream backport in the commit message
a7c801b|kubelet: If the container status is created, we are waiting: does not specify an upstream backport in the commit message
af16f13|Add additional metric tracking timing post-prioritization: does not specify an upstream backport in the commit message
b7d8ccc|kubelet: Record a metric for latency of pod status update: does not specify an upstream backport in the commit message
c5822d6|kubelet: Pass context down to long running methods (2/5): does not specify an upstream backport in the commit message
d699e5f|kubelet: Pass context down to long running methods (5/5): does not specify an upstream backport in the commit message
f5ff48b|kubelet: Pass context down to long running methods (3/5): does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci · 2022-02-04T22:35:25Z

@smarterclayton: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-cgroupsv2	`af16f13`	link	false	`/test e2e-aws-cgroupsv2`
ci/prow/unit	`af16f13`	link	true	`/test unit`
ci/prow/verify	`af16f13`	link	true	`/test verify`
ci/prow/e2e-gcp-upgrade	`af16f13`	link	true	`/test e2e-gcp-upgrade`
ci/prow/e2e-agnostic-cmd	`af16f13`	link	false	`/test e2e-agnostic-cmd`
ci/prow/e2e-aws-serial	`af16f13`	link	true	`/test e2e-aws-serial`
ci/prow/verify-commits	`af16f13`	link	true	`/test verify-commits`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-02-17T02:31:18Z

@smarterclayton: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2022-05-18T12:21:03Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-06-17T12:49:00Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-07-17T13:11:27Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-07-17T13:11:52Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton added 6 commits February 1, 2022 18:22

openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Feb 1, 2022

openshift-ci bot requested review from deads2k and rphillips February 1, 2022 23:29

openshift-ci bot added the vendor-update Touching vendor dir or related files label Feb 1, 2022

DO NOT MERGE: Comment out the dockershim

655bc69

None of the refactors touch it as it is deleted upstream

smarterclayton force-pushed the context_cancel_status_down branch from a8fc39b to c5618c9 Compare February 1, 2022 23:48

smarterclayton force-pushed the context_cancel_status_down branch from c5618c9 to 031d339 Compare February 2, 2022 19:06

smarterclayton force-pushed the context_cancel_status_down branch from 031d339 to 6c7d7b3 Compare February 2, 2022 21:03

smarterclayton added 3 commits February 2, 2022 22:19

kubelet: Record a metric for latency of pod status update

b7d8ccc

Track how long it takes for pod updates to propagate from detection to successful change on API server. Will guide future improvements in pod start and shutdown latency.

kubelet: Prioritize certain pod status updates

6bb6cb3

Some pod status transitions directly impact end-to-end user latency in the Kubelet, such as pods going ready, going unready, or becoming Succeeded or Failed. Prioritize the order that pods are updated in to minimize that latency.

smarterclayton force-pushed the context_cancel_status_down branch from 6c7d7b3 to 6bb6cb3 Compare February 3, 2022 03:20

Add additional metric tracking timing post-prioritization

af16f13

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2022

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 17, 2022

openshift-ci bot closed this Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE: Testing context cancellation and status prioritization together #1163

DO NOT MERGE: Testing context cancellation and status prioritization together #1163

smarterclayton commented Feb 1, 2022

openshift-ci-robot commented Feb 1, 2022

openshift-ci bot commented Feb 1, 2022

openshift-ci-robot commented Feb 1, 2022

openshift-ci-robot commented Feb 2, 2022

openshift-ci-robot commented Feb 2, 2022

openshift-ci-robot commented Feb 3, 2022

openshift-ci-robot commented Feb 4, 2022

openshift-ci bot commented Feb 4, 2022

openshift-ci bot commented Feb 17, 2022

openshift-bot commented May 18, 2022

openshift-bot commented Jun 17, 2022

openshift-bot commented Jul 17, 2022

openshift-ci bot commented Jul 17, 2022

DO NOT MERGE: Testing context cancellation and status prioritization together #1163

DO NOT MERGE: Testing context cancellation and status prioritization together #1163

Conversation

smarterclayton commented Feb 1, 2022

openshift-ci-robot commented Feb 1, 2022

openshift-ci bot commented Feb 1, 2022

openshift-ci-robot commented Feb 1, 2022

openshift-ci-robot commented Feb 2, 2022

openshift-ci-robot commented Feb 2, 2022

openshift-ci-robot commented Feb 3, 2022

openshift-ci-robot commented Feb 4, 2022

openshift-ci bot commented Feb 4, 2022

openshift-ci bot commented Feb 17, 2022

openshift-bot commented May 18, 2022

openshift-bot commented Jun 17, 2022

openshift-bot commented Jul 17, 2022

openshift-ci bot commented Jul 17, 2022