ci-operator/step-registry/ipi/conf/aws: Default to m5.xlarge COMPUTE_NODE_TYPE #19195

wking · 2021-06-12T21:00:54Z

Bumping from m4, to which we've defaulted since a8426c0 (#15923). For more on m4 vs. m5 in our AWS CI zones, see rhbz#1713157. From the m5 docs, m5.xlarge has 4 vCPU and 16 GiB memory, just like m4.xlarge. It should also bump our EBS bandwidth from m4.xlarge's "dedicated 750 Mbps" to m5.xlarge's "up to 4,750 Mbps".

This should avoid failures like:

alert MachineWithNoRunningPhase fired for 3523 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.77:8443", job="machine-api-operator", name="ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb", namespace="openshift-machine-api", phase="Failed", pod="machine-api-operator-55dd6d8d9d-gh5xw", service="machine-api-operator", severity="warning"}
alert MachineWithoutValidNode fired for 3474 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.77:8443", job="machine-api-operator", name="ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb", namespace="openshift-machine-api", phase="Failed", pod="machine-api-operator-55dd6d8d9d-gh5xw", service="machine-api-operator", severity="warning"}

which is from picking a zone that lacks m4 support:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1403576531913019392/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/machines.json | jq -r '.items[] | select(.metadata.name == "ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb").status.errorMessage'
ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb: reconciler failed to Create machine: failed to launch instance: error launching instance: Your requested instance type (m4.xlarge) is not supported in your requested Availability Zone (us-west-2d). Please retry your request by not specifying an Availability Zone or choosing us-west-2a, us-west-2b, us-west-2c.

…NODE_TYPE Bumping from m4, to which we've defaulted since a8426c0 (steps: Update cloud (AWS/GCP) to match Azure worker size of 4 core, 2021-02-16, openshift#15923). For more on m4 vs. m5 in our AWS CI zones, see [1]. From [2], m5.xlarge has 4 vCPU and 16 GiB memory, just like m4.xlarge [3]. It should also bump our EBS bandwidth from m4.xlarge's "dedicated 750 Mbps" [3] to m5.xlarge's "up to 4,750 Mbps". This should avoid failures like [4]: alert MachineWithNoRunningPhase fired for 3523 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.77:8443", job="machine-api-operator", name="ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb", namespace="openshift-machine-api", phase="Failed", pod="machine-api-operator-55dd6d8d9d-gh5xw", service="machine-api-operator", severity="warning"} alert MachineWithoutValidNode fired for 3474 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.77:8443", job="machine-api-operator", name="ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb", namespace="openshift-machine-api", phase="Failed", pod="machine-api-operator-55dd6d8d9d-gh5xw", service="machine-api-operator", severity="warning"} which is from picking a zone that lacks m4 support: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1403576531913019392/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/machines.json | jq -r '.items[] | select(.metadata.name == "ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb").status.errorMessage' ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb: reconciler failed to Create machine: failed to launch instance: error launching instance: Your requested instance type (m4.xlarge) is not supported in your requested Availability Zone (us-west-2d). Please retry your request by not specifying an Availability Zone or choosing us-west-2a, us-west-2b, us-west-2c. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1713157#c1 [2]: https://aws.amazon.com/ec2/instance-types/m5/ [3]: https://aws.amazon.com/ec2/instance-types/ [4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1403576531913019392

openshift-ci · 2021-06-13T01:32:33Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/rehearse/redhat-developer/jenkins-operator/main/e2e	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/ovn-kubernetes/release-4.9/e2e-ovn-hybrid-step-registry	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/origin/release-4.1/e2e-aws-builds	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/origin/release-4.1/e2e-aws-image-ecosystem	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/windows-machine-config-operator/release-4.9/aws-e2e-upgrade	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-logging-operator/tech-preview/e2e-operator	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/origin/release-4.2/e2e-cmd	`072d246`	link	`/test pj-rehearse`
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/cloud-credential-operator/release-4.9/e2e-aws-manual-oidc	`072d246`	link	`/test pj-rehearse`
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-workers-rhel7	`072d246`	link	`/test pj-rehearse`
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node	`072d246`	link	`/test pj-rehearse`
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-csi-migration	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/origin/release-4.9/e2e-aws-disruptive	`072d246`	link	`/test pj-rehearse`
ci/rehearse/operator-framework/operator-marketplace/release-4.9/e2e-aws-upgrade	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/release-4.9/e2e-aws-upgrade	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/windows-machine-config-operator/release-4.9/aws-e2e-operator	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-cloud-controller-manager-operator/release-4.9/e2e-aws-ccm	`072d246`	link	`/test pj-rehearse`
ci/rehearse/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade	`072d246`	link	`/test pj-rehearse`
ci/prow/pj-rehearse	`072d246`	link	`/test pj-rehearse`
ci/rehearse/openshift/kubernetes/release-4.9/configmap-scale	`072d246`	link	`/test pj-rehearse`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2021-06-14T15:03:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/ipi/OWNERS~~ [petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-06-14T15:16:09Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-06-14T15:24:27Z

@wking: Updated the step-registry configmap in namespace ci at cluster app.ci using the following files:

key ipi-conf-aws-ref.yaml using file ci-operator/step-registry/ipi/conf/aws/ipi-conf-aws-ref.yaml

Details

In response to this:

Bumping from m4, to which we've defaulted since a8426c0 (#15923). For more on m4 vs. m5 in our AWS CI zones, see rhbz#1713157. From the m5 docs, m5.xlarge has 4 vCPU and 16 GiB memory, just like m4.xlarge. It should also bump our EBS bandwidth from m4.xlarge's "dedicated 750 Mbps" to m5.xlarge's "up to 4,750 Mbps".

This should avoid failures like:

alert MachineWithNoRunningPhase fired for 3523 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.77:8443", job="machine-api-operator", name="ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb", namespace="openshift-machine-api", phase="Failed", pod="machine-api-operator-55dd6d8d9d-gh5xw", service="machine-api-operator", severity="warning"}
alert MachineWithoutValidNode fired for 3474 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.77:8443", job="machine-api-operator", name="ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb", namespace="openshift-machine-api", phase="Failed", pod="machine-api-operator-55dd6d8d9d-gh5xw", service="machine-api-operator", severity="warning"}

which is from picking a zone that lacks m4 support:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1403576531913019392/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/machines.json | jq -r '.items[] | select(.metadata.name == "ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb").status.errorMessage'
ci-op-2bslq277-8d118-ldpvh-worker-us-west-2d-9jlgb: reconciler failed to Create machine: failed to launch instance: error launching instance: Your requested instance type (m4.xlarge) is not supported in your requested Availability Zone (us-west-2d). Please retry your request by not specifying an Availability Zone or choosing us-west-2a, us-west-2b, us-west-2c.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from crawford and ewolinetz June 12, 2021 21:01

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2021

petr-muller approved these changes Jun 14, 2021

View reviewed changes

openshift-ci bot assigned petr-muller Jun 14, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 14, 2021

openshift-merge-robot merged commit aec409a into openshift:master Jun 14, 2021

wking deleted the default-aws-to-m5 branch June 14, 2021 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci-operator/step-registry/ipi/conf/aws: Default to m5.xlarge COMPUTE_NODE_TYPE #19195

ci-operator/step-registry/ipi/conf/aws: Default to m5.xlarge COMPUTE_NODE_TYPE #19195

Uh oh!

wking commented Jun 12, 2021

Uh oh!

openshift-ci bot commented Jun 13, 2021

Uh oh!

openshift-ci bot commented Jun 14, 2021

Uh oh!

openshift-bot commented Jun 14, 2021

Uh oh!

openshift-ci bot commented Jun 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ci-operator/step-registry/ipi/conf/aws: Default to m5.xlarge COMPUTE_NODE_TYPE #19195

ci-operator/step-registry/ipi/conf/aws: Default to m5.xlarge COMPUTE_NODE_TYPE #19195

Uh oh!

Conversation

wking commented Jun 12, 2021

Uh oh!

openshift-ci bot commented Jun 13, 2021

Uh oh!

openshift-ci bot commented Jun 14, 2021

Uh oh!

openshift-bot commented Jun 14, 2021

Uh oh!

openshift-ci bot commented Jun 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants