Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jan 2, 2020

Forward-porting #288 to 4.3. Although 4.3 kubelet capacity reporting works, we still need to drop the 4.3 request, to support flows like:

  1. 4.2 cluster running with 4.2 CVO and 4.2 kubelets (so no capacity reporting).
  2. Admin requests an update to 4.3.1.
  3. 4.2 CVO launches a version pod without requests, because of the 4.2 reversion (Bug 1786315: pkg/cvo/updatepayload: Drop ephemeral-storage request #288). This works fine.
  4. Update gets far enough to run a 4.3 CVO.
  5. Update hangs on some 4.3.1 bug, while it's still running 4.2 kubelets.
  6. Admin requests an update to 4.3.2.
  7. 4.3 CVO launches a version pod with an ephemeral-storage request, which hangs because the 4.2 kubelets are still running and not reporting ephemeral-storage capacity.

I dunno why there is no capacity reporting in 4.2, but
ephemeral-storage capacity reporting is not working there, leading to
version pods dying with [1]:

  Node didn't have enough resource: ephemeral-storage, requested: 2097152, used: 0, capacity: 0

For an example of a 4.2 cluster without ephemeral-storage capacity
reporting, see this 4.2.10 -> 4.2.12 update test [2]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12620/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a0dbe73b7831a8ddb9a2c58a560461d7c2c23a92231289a2104b93e7723c0eff/cluster-scoped-resources/core/nodes/ip-10-0-129-58.ec2.internal.yaml | yaml2json | jq .status.capacity | json2yaml
  attachable-volumes-aws-ebs: '39'
  cpu: '4'
  hugepages-1Gi: '0'
  hugepages-2Mi: '0'
  memory: 16419384Ki
  pods: '250'

Capacity reporting is working in 4.3, e.g. see this 4.2.12 -> 4.3.0-0.nightly-2020-01-02-141332
update test [3].

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13437/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-c6c63e67c3d38a704c8695a40bb64b9975df2bda3f00c9379592cd5596126f2d/cluster-scoped-resources/core/nodes/ip-10-0-130-241.ec2.internal.yaml | yaml2json | jq .status.capacity | json2yaml
  attachable-volumes-aws-ebs: '39'
  cpu: '4'
  ephemeral-storage: 124768236Ki
  hugepages-1Gi: '0'
  hugepages-2Mi: '0'
  memory: 16419384Ki
  pods: '250'

Although 4.3 kubelet capacity reporting works, we still need to drop
the 4.3 request, to support flows like:

1. 4.2 cluster running with 4.2 CVO and 4.2 kubelets (so no capacity
   reporting).
2. Admin requests an update to 4.3.1.
3. 4.2 CVO launches a version pod without requests, because of the 4.2
   reversion [4].  This works fine.
4. Update gets far enough to run a 4.3 CVO.
5. Update hangs on some 4.3.1 bug, while it's still running 4.2
   kubelets.
6. Admin requests an update to 4.3.2.
7. 4.3 CVO launches a version pod with an ephemeral-storage request,
   which hangs because the 4.2 kubelets are still running and not
   reporting ephemeral-storage capacity.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1786315
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12620
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13437
[4]: openshift#288
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 2, 2020
@wking wking changed the base branch from master to release-4.3 January 2, 2020 19:09
@openshift-ci-robot
Copy link
Contributor

@wking: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

pkg/cvo/updatepayload: Drop ephemeral-storage request

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 2, 2020
@wking wking changed the title pkg/cvo/updatepayload: Drop ephemeral-storage request Bug 1787422: pkg/cvo/updatepayload: Drop ephemeral-storage request Jan 2, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 2, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1787422, which is invalid:

  • expected dependent Bugzilla bug 1786315 to be in one of the following states: MODIFIED, ON_QA, VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is NEW instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1787422: pkg/cvo/updatepayload: Drop ephemeral-storage request

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Jan 2, 2020

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1787422, which is invalid:

  • expected Bugzilla bug 1787422 to depend on a bug in one of the following states: MODIFIED, ON_QA, VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but no dependents were found

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Jan 2, 2020

/bugzilla refresh
/retest

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jan 2, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1787422, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh
/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Jan 2, 2020

I think CI is hung up on my changing the PR base from master to release-4.3. Project name from here. Cleared the project with:

oc delete project ci-op-n14zrm00

Trying again:

/retest

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/integration 4e1485f link /test integration
ci/prow/e2e-aws 4e1485f link /test e2e-aws
ci/prow/e2e-aws-upgrade 4e1485f link /test e2e-aws-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Jan 2, 2020

Looks like CI cannot recover. Opening a replacement PR...

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closed this PR.

In response to this:

Looks like CI cannot recover. Opening a replacement PR...

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants