Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-operator/step-registry/ipi/conf/azure: Get region from Boskos lease #12584

Merged

Conversation

wking
Copy link
Member

@wking wking commented Oct 8, 2020

To help avoid errors like "we randomly assigned more Azure centralus clusters than we had capacity for and they died on quota limits". Following up on #11752.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2020
@wking
Copy link
Member Author

wking commented Oct 8, 2020

Is LEASED_RESOURCE available for template-based jobs too? We have similar code here and in other templates, which it would be nice to drive via the lease name.

@abhinavdahiya
Copy link
Contributor

Is LEASED_RESOURCE available for template-based jobs too? We have similar code here and in other templates, which it would be nice to drive via the lease name.

Based on https://steps.ci.openshift.org/help/leases#static

A test may access the name of the resource that was acquired using the ${LEASED_RESOURCE} environment variable. these names of the region should be available.

@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Oct 9, 2020

But I think the lease update and the use of these might have be in separate PRs because the rehearsals might be using existing boskos setup..
as seen in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/12584/rehearse-12584-pull-ci-openshift-cloud-credential-operator-master-e2e-azure/1314353153147146240#1:build-log.txt%3A50

level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.azure.region: Invalid value: \"34da1949-9e79-45b7-b17a-6f75cb724e9c\": region \"34da1949-9e79-45b7-b17a-6f75cb724e9c\" is not valid or not available for this account"

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/cluster-network-operator/master/e2e-azure-ovn 692452cc14ffba00c6ce8c3d43f42f65c10025c8 link /test pj-rehearse
ci/rehearse/openshift/cloud-credential-operator/master/e2e-azure 692452cc14ffba00c6ce8c3d43f42f65c10025c8 link /test pj-rehearse
ci/rehearse/kubevirt/hyperconverged-cluster-operator/release-1.2/hco-e2e-image-index-azure 692452cc14ffba00c6ce8c3d43f42f65c10025c8 link /test pj-rehearse
ci/rehearse/kubevirt/hyperconverged-cluster-operator/master/hco-e2e-image-index-azure 692452cc14ffba00c6ce8c3d43f42f65c10025c8 link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-azure-shared-vpc 692452cc14ffba00c6ce8c3d43f42f65c10025c8 link /test pj-rehearse
ci/prow/pj-rehearse 692452cc14ffba00c6ce8c3d43f42f65c10025c8 link /test pj-rehearse

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Oct 9, 2020

...these names of the region should be available.

Do I have to plumb it through into the template variables? Or does something look at template pods and then inject these variables there?

@wking
Copy link
Member Author

wking commented Oct 9, 2020

But I think the lease update and the use of these might have be in separate PRs because the rehearsals might be using existing boskos setup...

Ah, right. I've filed #12589 with just number -> region-name bumps.

wking added a commit to wking/openshift-release that referenced this pull request Oct 14, 2020
…gion

With some guessing about what we support.  No consumers yet, but
consumer rehearsals run against the production Boskos server [1], so
we need to land these before we can consume them.

[1]: openshift#12584 (comment)
@wking wking force-pushed the per-region-azure-leases branch 2 times, most recently from f2be751 to 414d098 Compare October 15, 2020 02:31
@wking
Copy link
Member Author

wking commented Oct 15, 2020

Rebased onto master with 692452cc14 -> 414d0981af, now that #12589 has landed and updated the production Boskos server.

Copy link
Contributor

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
Release at will

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2020
@wking
Copy link
Member Author

wking commented Oct 15, 2020

cloud-credential-operator-master-e2e-azure:

level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.azure.region: Invalid value: \"ba6c74e5-e0be-4b19-b38c-8431ce30cac3\": region \"ba6c74e5-e0be-4b19-b38c-8431ce30cac3\" is not valid or not available for this account"

So I need to do something to get the leased name out of that lease ID...

@wking wking changed the title core-services/prow/02_config/_boskos.yaml: Shard Azure by region ci-operator/step-registry/ipi/conf/azure: Get region from Boskos lease Oct 15, 2020
To help avoid errors like "we randomly assigned more Azure centralus
clusters than we had capacity for and they died on quota limits".
From [1]:

> A test may access the name of the resource that was acquired using
> the ${LEASED_RESOURCE} environment variable.

[1]: https://steps.ci.openshift.org/help/leases#static
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2020
@wking
Copy link
Member Author

wking commented Oct 15, 2020

Same, so not a "Boskos config was slow to go live" thing:

level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.azure.region: Invalid value: \"b3c54863-cf4c-4383-bfc7-3a3152242481\": region \"b3c54863-cf4c-4383-bfc7-3a3152242481\" is not valid or not available for this account"

@alvaroaleman
Copy link
Contributor

/uncc
/unassign
as this creates quite a bit of noise during testing :)

@wking
Copy link
Member Author

wking commented Dec 10, 2020

/retest

@wking
Copy link
Member Author

wking commented Dec 10, 2020

Oops, I need #14262 to go live before these rehearsals will pass...

*) echo >&2 "invalid Azure region index"; exit 1;;
esac
echo "Azure region: ${AZURE_REGION}"
REGION="${LEASED_RESOURCE}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the leased resource is something like <region>--<some number> as in https://github.com/openshift/release/pull/14262/files#diff-5169f2a74d1497f38a44e9adc57f6993269a89c3ddf90ab01f5d1d114ef61e58R210

i think we need to transform the lease to region here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, we split at the -- in ci-operator since openshift/ci-tools#1306.

@wking
Copy link
Member Author

wking commented Dec 10, 2020

/retest

@wking
Copy link
Member Author

wking commented Dec 10, 2020

network operator e2e-azure-ovn:

2020/12/10 18:25:32 Acquiring leases for "e2e-azure-ovn"
2020/12/10 18:25:32 Acquiring 1 lease(s) for "azure4-quota-slice"
2020/12/10 18:25:32 Acquired lease(s) [eastus1--1] for "azure4-quota-slice"
...
level=info msg=Credentials loaded from file "/var/run/secrets/ci.openshift.io/cluster-profile/osServicePrincipal.json"
level=fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": [platform.azure.region: Invalid value: "eastus1": region "eastus1" is not valid or not available for this account, compute[0].platform.azure.type: Invalid value: "Standard_D4s_v3": not found in region eastus1]

Dunno what that... instance-type-check?... failure is about, but looks like the lease process is working :).

@wking
Copy link
Member Author

wking commented Dec 10, 2020

installer e2e-azure-shared-vpc:

2020/12/10 18:32:17 Acquired lease(s) [eastus2--0] for "azure4-quota-slice"
...
level=info msg=Credentials loaded from file "/var/run/secrets/ci.openshift.io/cluster-profile/osServicePrincipal.json"
level=fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": platform.azure.computeSubnet: Invalid value: "subnet-2": failed to retrieve compute subnet

Another "lease->region seems ok, but then the installer got confused, and I'm not sure why".

@wking
Copy link
Member Author

wking commented Dec 10, 2020

hco-e2e-image-index-azure:

level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.azure.region: Invalid value: \"eastus1\": region \"eastus1\" is not valid or not available for this account"

@openshift-merge-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/cluster-network-operator/master/e2e-azure-ovn 7aa198b link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-azure-shared-vpc 7aa198b link /test pj-rehearse
ci/rehearse/kubevirt/hyperconverged-cluster-operator/release-1.2/hco-e2e-image-index-azure 7aa198b link /test pj-rehearse
ci/rehearse/openshift/cluster-api-provider-azure/master/e2e-upgrade 7aa198b link /test pj-rehearse
ci/prow/pj-rehearse 7aa198b link /test pj-rehearse

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Dec 11, 2020

Azure provider e2e-upgrade mostly passed, just failed some of the e2e test-cases (orthogonal to this PR). Also got a westus--7 lease, and confirming that's what the cluster is using:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/12584/rehearse-12584-pull-ci-openshift-cluster-api-provider-azure-master-e2e-upgrade/1337098998833483776/artifacts/e2e-upgrade/gather-extra/machinesets.json | jq -r '.items[].spec.template.spec.providerSpec.value.location'
westus

So I think this is solid enough to land.

/hold cancel

Anyone comfortable dropping a :lgtm:? @alvaroaleman , you'd approved this way back?

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 11, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 3320882 into openshift:master Dec 11, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 2 configmaps:

  • step-registry configmap in namespace ci at cluster api.ci using the following files:
    • key ipi-conf-azure-commands.sh using file ci-operator/step-registry/ipi/conf/azure/ipi-conf-azure-commands.sh
  • step-registry configmap in namespace ci at cluster app.ci using the following files:
    • key ipi-conf-azure-commands.sh using file ci-operator/step-registry/ipi/conf/azure/ipi-conf-azure-commands.sh

In response to this:

To help avoid errors like "we randomly assigned more Azure centralus clusters than we had capacity for and they died on quota limits". Following up on #11752.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the per-region-azure-leases branch December 11, 2020 16:55
wking added a commit to wking/openshift-release that referenced this pull request Dec 11, 2020
Maybe this is the wrong name?  We're seeing [1]:

  level=info msg=Credentials loaded from file "/var/run/secrets/ci.openshift.io/cluster-profile/osServicePrincipal.json"
  level=fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": [platform.azure.region: Invalid value: "eastus1": region "eastus1" is not valid or not available for this account, compute[0].platform.azure.type: Invalid value: "Standard_D4s_v3": not found in region eastus1]

but only from eastus1.  And before 7aa198b
(ci-operator/step-registry/ipi/conf/azure: Region from Boskos lease,
2020-10-14, openshift#12584), which landed last night, we weren't using
useast1.

[1]: openshift#12584 (comment)
wking added a commit to wking/ci-tools that referenced this pull request Dec 11, 2020
Like openshift/release@7aa198b3c7
(ci-operator/step-registry/ipi/conf/azure: Get region from Boskos
lease, 2020-10-14, openshift/release#12584), but for AWS and GCP here
too, now that we have evidence that the approach is working.

I'm keeping a switch for AWS to give folks a pattern for selecting
zones, if AWS breaks a zone in a particular region.  We should
probably distribute that (and the shared subnets, for shared-subnet
tests?) via leases as well, but baby steps.
wking added a commit to wking/openshift-release that referenced this pull request Dec 14, 2020
Like 7aa198b (ci-operator/step-registry/ipi/conf/azure: Get region
from Boskos lease, 2020-10-14, openshift#12584), but for AWS.  This sets us up
for sharding GCP by region, if we ever need that (e.g. GCP has an
outage in one region).

I'm leaving ci-operator/templates alone; hopefully those will be gone
soon.  I've already updated ci-tools with
openshift/ci-tools@00ebab17e1 (pkg/steps/clusterinstall/template: Get
region from Boskos lease, 2020-12-11, openshift/ci-tools#1527).

I still wish the end-to-end suite pulled the region out of the cluster
itself, but until it does, lean on the Infrastructure status [1] like
we've been doing for AWS since bf0a271 (AWS e2e provider should
identify zone and master and multizone, 2019-01-05, openshift#2507).

[1]: https://github.com/openshift/api/blob/164a2fb63b5f12918c439a5a0a768aa911bcad99/config/v1/types_infrastructure.go#L327-L328
wking added a commit to wking/openshift-release that referenced this pull request Dec 15, 2020
Like 7aa198b (ci-operator/step-registry/ipi/conf/azure: Get region
from Boskos lease, 2020-10-14, openshift#12584), but for AWS.

I'm keeping a switch for AWS to give folks a pattern for selecting
zones, if AWS breaks a zone in a particular region.  We should
probably distribute that (and the shared subnets, for shared-subnet
tests?) via leases as well, but baby steps.

I'm leaving ci-operator/templates alone; hopefully those will be gone
soon.  I've already updated ci-tools with
openshift/ci-tools@00ebab17e1 (pkg/steps/clusterinstall/template: Get
region from Boskos lease, 2020-12-11, openshift/ci-tools#1527).

I'm also normalizing to uppercase shell variables, now that we are no
longer constrained by Go template expansion.  Hmm, at least that's why
I thought the variables used to be lowercase, see 43e08e7
(ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e:
Push AWS-specific default base domain down into the template,
2019-09-23, openshift#5151).  But looking at the templates when de3de20
(step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708),
I'm now not sure why these step commands were using lowercase
variable names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
5 participants