Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/asset/machines/aws: Only return available zones #1210

Merged
merged 1 commit into from
Mar 6, 2019

Conversation

wking
Copy link
Member

@wking wking commented Feb 8, 2019

Zones can also be impaired, unavailable, or "information" (docs here and here). I dunno what that last one is about. But we only want to put machines in available zones ;).

This is a bit racy, because we could get different responses at both call-sites. Moving forward, we'll want to shift this request into a single available-zones asset.

Fixes #1209.

Zones can also be impaired, unavailable, or "information" [1,2].  I
dunno what that last one is about.  But we only want to put machines
in available zones ;).

This is a bit racy, because we could get different responses at both
call-sites.  Moving forward, we'll want to shift this request into a
single available-zones asset.

[1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeAvailabilityZones.html
[2]: https://www.terraform.io/docs/providers/aws/d/availability_zones.html#state
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 8, 2019
@wking
Copy link
Member Author

wking commented Feb 8, 2019

e2e-aws:

Flaky tests:

[Feature:DeploymentConfig] deploymentconfigs with test deployments [Conformance] should run a deployment to completion and then scale to zero [Suite:openshift/conformance/parallel/minimal]

Failing tests:

[Conformance][Area:Networking][Feature:Router] The default ClusterIngress should support default wildcard reencrypt routes through external DNS [Suite:openshift/conformance/parallel/minimal]

/retest

@wking
Copy link
Member Author

wking commented Feb 8, 2019

All green, when you have a moment @abhinavdahiya, @staebler.

@abhinavdahiya
Copy link
Contributor

Moving forward, we'll want to shift this request into a single available-zones asset.

Do we have an PR open or tracking issue...

@wking
Copy link
Member Author

wking commented Feb 8, 2019

Moving forward, we'll want to shift this request into a single available-zones asset.

Do we have an PR open or tracking issue...

#1121 will get us closer. I can file a ticket for what will still be left after that later tonight.

@wking
Copy link
Member Author

wking commented Feb 8, 2019

But "zone goes offline mid-asset-graph" is probably going to break automatic zone selection even once we're down to a single zone-listing call (we'll just be internally consistent when we die ;).

@wking
Copy link
Member Author

wking commented Feb 8, 2019

Ok, I think #1045, #1121, and #1212 together will address the consolidation to a single available-zones request in Go. I still think we want this PR to get the filtering in Go (which we'll want forever) and Terraform (which we'll need until #1045 and #1121 together let us drop that Terraform zone request).

@staebler
Copy link
Contributor

Ok, I think #1045, #1121, and #1212 together will address the consolidation to a single available-zones request in Go. I still think we want this PR to get the filtering in Go (which we'll want forever) and Terraform (which we'll need until #1045 and #1121 together let us drop that Terraform zone request).

I don't think we are quite there yet for dropping the Terraform zone request even after #1045 and #1121. I would like to see us send from Go to Terraform the union of the zones used in all of the machine pools. Then, we can get rid of the Terraform zone request.

@staebler
Copy link
Contributor

Do we need to modify the Go fetch of availability zones as well to only return the zones that have a state of "available"?

wking added a commit to wking/openshift-installer that referenced this pull request Feb 28, 2019
…-release:4.0.0-0.6

Clayton pushed 4.0.0-0.nightly-2019-02-27-213933 to
quay.io/openshift-release-dev/ocp-release:4.0.0-0.6.  Extracting the
associated RHCOS build:

  $ oc adm release info --pullspecs quay.io/openshift-release-dev/ocp-release:4.0.0-0.6 | grep machine-os-content
    machine-os-content                            registry.svc.ci.openshift.org/ocp/4.0-art-latest-2019-02-27-213933@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b
  $ oc image info registry.svc.ci.openshift.org/ocp/4.0-art-latest-2019-02-26-125216@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b | grep version
              version=47.330

that's the same machine-os-content image referenced from 4.0.0-0.5,
which we used for installer v0.13.0.

Renaming OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE gets us CI testing
of the pinned release despite openshift/release@60007df2 (Use
RELEASE_IMAGE_LATEST for CVO payload, 2018-10-03,
openshift/release#1793).

Also comment out regions which this particular RHCOS build wasn't
pushed to, leaving only:

  $ curl -s https://releases-rhcos.svc.ci.openshift.org/storage/releases/maipo/47.330/meta.json | jq -r '.amis[] | .name'
  ap-northeast-1
  ap-northeast-2
  ap-south-1
  ap-southeast-1
  ap-southeast-2
  ca-central-1
  eu-central-1
  eu-west-1
  eu-west-2
  eu-west-3
  sa-east-1
  us-east-1
  us-east-2
  us-west-1
  us-west-2

I'd initially expected to export the pinning environment variables in
release.sh, but I've put them in build.sh here because our continuous
integration tests use build.sh directly and don't go through
release.sh.

Using the slick, new change-log generator from [1], here's everything
that changed in the update payload:

  $ oc adm release info --changelog ~/.local/lib/go/src --changes-from quay.io/openshift-release-dev/ocp-release:4.0.0-0.5 quay.io/openshift-release-dev/ocp-release:4.0.0-0.6
  # 4.0.0-0.6

  Created: 2019-02-28 20:40:11 +0000 UTC
  Image Digest: `sha256:5ce3d05da3bfa3d0310684f5ac53d98d66a904d25f2e55c2442705b628560962`
  Promoted from registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-27-213933

  ## Changes from 4.0.0-0.5

  ### Components

  * Kubernetes 1.12.4

  ### New images

  * [pod](https://github.com/openshift/images) git [2f60da39](openshift/images@2f60da3) `sha256:c0d602467dfe0299ce577ba568a9ef5fb9b0864bac6455604258e7f5986d3509`

  ### Rebuilt images without code change

  * [cloud-credential-operator](https://github.com/openshift/cloud-credential-operator) git [01bbf372](openshift/cloud-credential-operator@01bbf37) `sha256:f87be09923a5cb081722634d2e0c3d0a5633ea2c23da651398d4e915ad9f73b0`
  * [cluster-autoscaler](https://github.com/openshift/kubernetes-autoscaler) git [d8a4a304](openshift/kubernetes-autoscaler@d8a4a30) `sha256:955413b82cf8054ce149bc05c18297a8abe9c59f9d0034989f08086ae6c71fa6`
  * [cluster-autoscaler-operator](https://github.com/openshift/cluster-autoscaler-operator) git [73c46659](openshift/cluster-autoscaler-operator@73c4665) `sha256:756e813fce04841993c8060d08a5684c173cbfb61a090ae67cb1558d76a0336e`
  * [cluster-bootstrap](https://github.com/openshift/cluster-bootstrap) git [05a5c8e6](openshift/cluster-bootstrap@05a5c8e) `sha256:dbdd90da7d256e8d49e4e21cb0bdef618c79d83f539049f89f3e3af5dbc77e0f`
  * [cluster-config-operator](https://github.com/openshift/cluster-config-operator) git [aa1805e7](openshift/cluster-config-operator@aa1805e) `sha256:773d3355e6365237501d4eb70d58cd0633feb541d4b6f23d6a5f7b41fd6ad2f5`
  * [cluster-dns-operator](https://github.com/openshift/cluster-dns-operator) git [ffb04ae9](openshift/cluster-dns-operator@ffb04ae) `sha256:ca15f98cc1f61440f87950773329e1fdf58e73e591638f18c43384ad4f8f84da`
  * [cluster-machine-approver](https://github.com/openshift/cluster-machine-approver) git [2fbc6a6b](openshift/cluster-machine-approver@2fbc6a6) `sha256:a66af3b1f4ae98257ab600d54f8c94f3a4136f85863bbe0fa7c5dba65c5aea46`
  * [cluster-node-tuned](https://github.com/openshift/openshift-tuned) git [278ee72d](openshift/openshift-tuned@278ee72) `sha256:ad71743cc50a6f07eba013b496beab9ec817603b07fd3f5c022fffbf400e4f4b`
  * [cluster-node-tuning-operator](https://github.com/openshift/cluster-node-tuning-operator) git [b5c14deb](openshift/cluster-node-tuning-operator@b5c14de) `sha256:e61d1fdb7ad9f5fed870e917a1bc8fac9ccede6e4426d31678876bcb5896b000`
  * [cluster-openshift-controller-manager-operator](https://github.com/openshift/cluster-openshift-controller-manager-operator) git [3f79b51b](openshift/cluster-openshift-controller-manager-operator@3f79b51) `sha256:8f3b40b4dd29186975c900e41b1a94ce511478eeea653b89a065257a62bf3ae9`
  * [cluster-svcat-apiserver-operator](https://github.com/openshift/cluster-svcat-apiserver-operator) git [547648cb](openshift/cluster-svcat-apiserver-operator@547648c) `sha256:e7c9323b91dbb11e044d5a1277d1e29d106d92627a6c32bd0368616e0bcf631a`
  * [cluster-svcat-controller-manager-operator](https://github.com/openshift/cluster-svcat-controller-manager-operator) git [9261f420](openshift/cluster-svcat-controller-manager-operator@9261f42) `sha256:097a429eda2306fcd49e14e4f5db8ec3a09a90fa29ebdbc98cc519511ab6fb5b`
  * [cluster-version-operator](https://github.com/openshift/cluster-version-operator) git [70c0232e](openshift/cluster-version-operator@70c0232) `sha256:7d59edff68300e13f0b9e56d2f2bc1af7f0051a9fbc76cc208239137ac10f782`
  * [configmap-reloader](https://github.com/openshift/configmap-reload) git [3c2f8572](openshift/configmap-reload@3c2f857) `sha256:32360c79d8d8d54cea03675c24f9d0a69877a2f2e16b949ca1d97440b8f45220`
  * [console-operator](https://github.com/openshift/console-operator) git [32ed7c03](openshift/console-operator@32ed7c0) `sha256:f8c07cb72dc8aa931bbfabca9b4133f3b93bc96da59e95110ceb8c64f3efc755`
  * [container-networking-plugins-supported](https://github.com/openshift/ose-containernetworking-plugins) git [f6a58dce](openshift/ose-containernetworking-plugins@f6a58dc) `sha256:c6434441fa9cc96428385574578c41e9bc833b6db9557df1dd627411d9372bf4`
  * [container-networking-plugins-unsupported](https://github.com/openshift/ose-containernetworking-plugins) git [f6a58dce](openshift/ose-containernetworking-plugins@f6a58dc) `sha256:bb589cf71d4f41977ec329cf808cdb956d5eedfc604e36b98cfd0bacce513ffc`
  * [coredns](https://github.com/openshift/coredns) git [fbcb8252](openshift/coredns@fbcb825) `sha256:2f1812a95e153a40ce607de9b3ace7cae5bee67467a44a64672dac54e47f2a66`
  * [docker-builder](https://github.com/openshift/builder) git [1a77d837](openshift/builder@1a77d83) `sha256:27062ab2c62869e5ffeca234e97863334633241089a5d822a19350f16945fbcb`
  * [etcd](https://github.com/openshift/etcd) git [a0e62b48](openshift/etcd@a0e62b4) `sha256:e4e9677d004f8f93d4f084739b4502c2957c6620d633e1fdb379c33243c684fa`
  * [grafana](https://github.com/openshift/grafana) git [58efe0eb](openshift/grafana@58efe0e) `sha256:548abcc50ccb8bb17e6be2baf050062a60fc5ea0ca5d6c59ebcb8286fc9eb043`
  * [haproxy-router](https://github.com/openshift/router) git [2c33f47f](openshift/router@2c33f47) `sha256:c899b557e4ee2ea7fdbe5c37b5f4f6e9f9748a39119130fa930d9497464bd957`
  * [k8s-prometheus-adapter](https://github.com/openshift/k8s-prometheus-adapter) git [815fa76b](openshift/k8s-prometheus-adapter@815fa76) `sha256:772c1b40b21ccaa9ffcb5556a1228578526a141b230e8ac0afe19f14404fdffc`
  * [kube-rbac-proxy](https://github.com/openshift/kube-rbac-proxy) git [3f271e09](openshift/kube-rbac-proxy@3f271e0) `sha256:b6de05167ecab0472279cdc430105fac4b97fb2c43d854e1c1aa470d20a36572`
  * [kube-state-metrics](https://github.com/openshift/kube-state-metrics) git [2ab51c9f](openshift/kube-state-metrics@2ab51c9) `sha256:611c800c052de692c84d89da504d9f386d3dcab59cbbcaf6a26023756bc863a0`
  * [libvirt-machine-controllers](https://github.com/openshift/cluster-api-provider-libvirt) git [7ff8b08f](openshift/cluster-api-provider-libvirt@7ff8b08) `sha256:6ab8749886ec26d45853c0e7ade3c1faaf6b36e09ba2b8a55f66c6cc25052832`
  * [multus-cni](https://github.com/openshift/ose-multus-cni) git [61f9e088](https://github.com/openshift/ose-multus-cni/commit/61f9e0886370ea5f6093ed61d4cfefc6dadef582) `sha256:e3f87811d22751e7f06863e7a1407652af781e32e614c8535f63d744e923ea5c`
  * [oauth-proxy](https://github.com/openshift/oauth-proxy) git [b771960b](openshift/oauth-proxy@b771960) `sha256:093a2ac687849e91671ce906054685a4c193dfbed27ebb977302f2e09ad856dc`
  * [openstack-machine-controllers](https://github.com/openshift/cluster-api-provider-openstack) git [c2d845b](openshift/cluster-api-provider-openstack@c2d845b) `sha256:f9c321de068d977d5b4adf8f697c5b15f870ccf24ad3e19989b129e744a352a7`
  * [operator-registry](https://github.com/operator-framework/operator-registry) git [0531400c](operator-framework/operator-registry@0531400) `sha256:730f3b504cccf07e72282caf60dc12f4e7655d7aacf0374d710c3f27125f7008`
  * [prom-label-proxy](https://github.com/openshift/prom-label-proxy) git [46423f9d](openshift/prom-label-proxy@46423f9) `sha256:3235ad5e22b6f560d447266e0ecb2e5655fda7c0ab5c1021d8d3a4202f04d2ca`
  * [prometheus](https://github.com/openshift/prometheus) git [6e5fb5dc](openshift/prometheus@6e5fb5d) `sha256:013455905e4a6313f8c471ba5f99962ec097a9cecee3e22bdff3e87061efad57`
  * [prometheus-alertmanager](https://github.com/openshift/prometheus-alertmanager) git [4617d550](openshift/prometheus-alertmanager@4617d55) `sha256:54512a6cf25cf3baf7fed0b01a1d4786d952d93f662578398cad0d06c9e4e951`
  * [prometheus-config-reloader](https://github.com/openshift/prometheus-operator) git [f8a0aa17](openshift/prometheus-operator@f8a0aa1) `sha256:244fc5f1a4a0aa983067331c762a04a6939407b4396ae0e86a1dd1519e42bb5d`
  * [prometheus-node-exporter](https://github.com/openshift/node_exporter) git [f248b582](openshift/node_exporter@f248b58) `sha256:390e5e1b3f3c401a0fea307d6f9295c7ff7d23b4b27fa0eb8f4017bd86d7252c`
  * [prometheus-operator](https://github.com/openshift/prometheus-operator) git [f8a0aa17](openshift/prometheus-operator@f8a0aa1) `sha256:6e697dcaa19e03bded1edf5770fb19c0d2cd8739885e79723e898824ce3cd8f5`
  * [service-catalog](https://github.com/openshift/service-catalog) git [b24ffd6f](openshift/service-catalog@b24ffd6) `sha256:85ea2924810ced0a66d414adb63445a90d61ab5318808859790b1d4b7decfea6`
  * [service-serving-cert-signer](https://github.com/openshift/service-serving-cert-signer) git [30924216](openshift/service-serving-cert-signer@3092421) `sha256:7f89db559ffbd3bf609489e228f959a032d68dd78ae083be72c9048ef0c35064`
  * [telemeter](https://github.com/openshift/telemeter) git [e12aabe4](openshift/telemeter@e12aabe) `sha256:fd518d2c056d4ab8a89d80888e0a96445be41f747bfc5f93aa51c7177cf92b92`

  ### [aws-machine-controllers](https://github.com/openshift/cluster-api-provider-aws)

  * client: add cluster-api-provider-aws to UserAgent for AWS API calls [openshift#167](openshift/cluster-api-provider-aws#167)
  * Drop the yaml unmarshalling [openshift#155](openshift/cluster-api-provider-aws#155)
  * [Full changelog](openshift/cluster-api-provider-aws@46f4852...c0c3b9e)

  ### [cli, deployer, hyperkube, hypershift, node, tests](https://github.com/openshift/ose)

  * Build OSTree using baked SELinux policy [#22081](https://github.com/openshift/ose/pull/22081)
  * NodeName was being cleared for `oc debug node/X` instead of set [#22086](https://github.com/openshift/ose/pull/22086)
  * UPSTREAM: 73894: Print the involved object in the event table [#22039](https://github.com/openshift/ose/pull/22039)
  * Publish CRD openapi [#22045](https://github.com/openshift/ose/pull/22045)
  * UPSTREAM: 00000: wait for CRD discovery to be successful once before [#22149](https://github.com/openshift/ose/pull/22149)
  * `oc adm release info --changelog` should clone if necessary [#22148](https://github.com/openshift/ose/pull/22148)
  * [Full changelog](openshift/ose@c547bc3...0cbcfc5)

  ### [cluster-authentication-operator](https://github.com/openshift/cluster-authentication-operator)

  * Add redeploy on serving cert and operator pod template change [openshift#75](openshift/cluster-authentication-operator#75)
  * Create the service before waiting for serving certs [openshift#84](openshift/cluster-authentication-operator#84)
  * [Full changelog](openshift/cluster-authentication-operator@78dd53b...35879ec)

  ### [cluster-image-registry-operator](https://github.com/openshift/cluster-image-registry-operator)

  * Enable subresource status [openshift#209](openshift/cluster-image-registry-operator#209)
  * Add ReadOnly flag [openshift#210](openshift/cluster-image-registry-operator#210)
  * do not setup ownerrefs for clusterscoped/cross-namespace objects [openshift#215](openshift/cluster-image-registry-operator#215)
  * s3: include operator version in UserAgent for AWS API calls [openshift#212](openshift/cluster-image-registry-operator#212)
  * [Full changelog](openshift/cluster-image-registry-operator@0780074...8060048)

  ### [cluster-ingress-operator](https://github.com/openshift/cluster-ingress-operator)

  * Adds info log msg indicating ns/secret used by DNSManager [openshift#134](openshift/cluster-ingress-operator#134)
  * Introduce certificate controller [openshift#140](openshift/cluster-ingress-operator#140)
  * [Full changelog](openshift/cluster-ingress-operator@1b4fa5a...09d14db)

  ### [cluster-kube-apiserver-operator](https://github.com/openshift/cluster-kube-apiserver-operator)

  * bump(*): fix installer pod shutdown and rolebinding [openshift#307](openshift/cluster-kube-apiserver-operator#307)
  * bump to fix early status [openshift#309](openshift/cluster-kube-apiserver-operator#309)
  * [Full changelog](openshift/cluster-kube-apiserver-operator@4016927...fa75c05)

  ### [cluster-kube-controller-manager-operator](https://github.com/openshift/cluster-kube-controller-manager-operator)

  * bump(*): fix installer pod shutdown and rolebinding [openshift#183](openshift/cluster-kube-controller-manager-operator#183)
  * bump to fix empty status [openshift#184](openshift/cluster-kube-controller-manager-operator#184)
  * [Full changelog](openshift/cluster-kube-controller-manager-operator@95f5f32...53ff6d8)

  ### [cluster-kube-scheduler-operator](https://github.com/openshift/cluster-kube-scheduler-operator)

  * Rotate kubeconfig [openshift#62](openshift/cluster-kube-scheduler-operator#62)
  * Don't pass nil function pointer to NewConfigObserver [openshift#65](openshift/cluster-kube-scheduler-operator#65)
  * [Full changelog](openshift/cluster-kube-scheduler-operator@50848b4...7066c96)

  ### [cluster-monitoring-operator](https://github.com/openshift/cluster-monitoring-operator)

  * *: Clean test invocation and documenation [openshift#267](openshift/cluster-monitoring-operator#267)
  * pkg/operator: fix progressing state of cluster operator [openshift#268](openshift/cluster-monitoring-operator#268)
  * jsonnet/main.jsonnet: Bump Prometheus to v2.7.1 [openshift#246](openshift/cluster-monitoring-operator#246)
  * OWNERS: Remove ironcladlou [openshift#204](openshift/cluster-monitoring-operator#204)
  * test/e2e: Refactor framework setup & wait for query logic [openshift#265](openshift/cluster-monitoring-operator#265)
  * jsonnet: Update dependencies [openshift#269](openshift/cluster-monitoring-operator#269)
  * [Full changelog](openshift/cluster-monitoring-operator@94b701f...3609aea)

  ### [cluster-network-operator](https://github.com/openshift/cluster-network-operator)

  * Update to be able to track both DaemonSets and Deployments [openshift#102](openshift/cluster-network-operator#102)
  * openshift-sdn: more service-catalog netnamespace fixes [openshift#108](openshift/cluster-network-operator#108)
  * [Full changelog](openshift/cluster-network-operator@9db4d03...15204e6)

  ### [cluster-openshift-apiserver-operator](https://github.com/openshift/cluster-openshift-apiserver-operator)

  * bump to fix status reporting [openshift#157](openshift/cluster-openshift-apiserver-operator#157)
  * [Full changelog](openshift/cluster-openshift-apiserver-operator@1ce6ac7...0a65fe4)

  ### [cluster-samples-operator](https://github.com/openshift/cluster-samples-operator)

  * use pumped up rate limiter, shave 30 seconds from startup creates [openshift#113](openshift/cluster-samples-operator#113)
  * [Full changelog](openshift/cluster-samples-operator@4726068...f001324)

  ### [cluster-storage-operator](https://github.com/openshift/cluster-storage-operator)

  * WaitForFirstConsumer in AWS StorageClass [openshift#12](openshift/cluster-storage-operator#12)
  * [Full changelog](openshift/cluster-storage-operator@dc42489...b850242)

  ### [console](https://github.com/openshift/console)

  * Add back OAuth configuration link in kubeadmin notifier [openshift#1202](openshift/console#1202)
  * Normalize display of <ResourceIcon> across browsers, platforms [openshift#1210](openshift/console#1210)
  * Add margin spacing so event info doesn't run together before truncating [openshift#1170](openshift/console#1170)
  * [Full changelog](openshift/console@a0b75bc...d10fb8b)

  ### [docker-registry](https://github.com/openshift/image-registry)

  * Bump k8s and OpenShift, use new docker-distribution branch [openshift#165](openshift/image-registry#165)
  * [Full changelog](openshift/image-registry@75a1fbe...afcc7da)

  ### [installer](https://github.com/openshift/installer)

  * data: route53 A records with SimplePolicy should not use health check [openshift#1308](openshift#1308)
  * bootkube.sh: do not hide problems with render [openshift#1274](openshift#1274)
  * data/bootstrap/files/usr/local/bin/bootkube: etcdctl from release image [openshift#1315](openshift#1315)
  * pkg/types/validation: Drop v1beta1 backwards compat hack [openshift#1251](openshift#1251)
  * pkg/asset/tls: self-sign etcd-client-ca [openshift#1267](openshift#1267)
  * pkg/asset/tls: self-sign aggregator-ca [openshift#1275](openshift#1275)
  * pkg/types/validation/installconfig: Drop nominal v1beta2 support [openshift#1319](openshift#1319)
  * Removing unused/deprecated security groups and ports. Updated AWS doc [openshift#1306](openshift#1306)
  * [Full changelog](openshift/installer@0208204...563f71f)

  ### [jenkins, jenkins-agent-maven, jenkins-agent-nodejs](https://github.com/openshift/jenkins)

  * recover from jenkins deps backleveling workflow-durable-task-step fro… [openshift#806](openshift/jenkins#806)
  * [Full changelog](openshift/jenkins@2485f9a...e4583ca)

  ### [machine-api-operator](https://github.com/openshift/machine-api-operator)

  * Rename labels from sigs.k8s.io to machine.openshift.io [openshift#213](openshift/machine-api-operator#213)
  * Remove clusters.cluster.k8s.io CRD [openshift#225](openshift/machine-api-operator#225)
  * MAO: Stop setting statusProgressing=true when resyincing same version [openshift#217](openshift/machine-api-operator#217)
  * Generate clientset for machine health check API [openshift#223](openshift/machine-api-operator#223)
  * [Full changelog](openshift/machine-api-operator@bf95d7d...34c3424)

  ### [machine-config-controller, machine-config-daemon, machine-config-operator, machine-config-server, setup-etcd-environment](https://github.com/openshift/machine-config-operator)

  * daemon: Only print status if os == RHCOS [openshift#495](openshift/machine-config-operator#495)
  * Add pod image to image-references [openshift#500](openshift/machine-config-operator#500)
  * pkg/daemon: stash the node object [openshift#464](openshift/machine-config-operator#464)
  * Eliminate use of cpu limits [openshift#503](openshift/machine-config-operator#503)
  * MCD: add ign validation check for mc.ignconfig [openshift#481](openshift/machine-config-operator#481)
  * [Full changelog](openshift/machine-config-operator@875f25e...f0b87fc)

  ### [operator-lifecycle-manager](https://github.com/operator-framework/operator-lifecycle-manager)

  * fix(owners): remove cross-namespace and cluster->namespace ownerrefs [openshift#729](operator-framework/operator-lifecycle-manager#729)
  * [Full changelog](operator-framework/operator-lifecycle-manager@1ac9ace...9186781)

  ### [operator-marketplace](https://github.com/operator-framework/operator-marketplace)

  * [opsrc] Do not delete csc during purge [openshift#117](operator-framework/operator-marketplace#117)
  * Remove Dependency on Owner References [openshift#118](operator-framework/operator-marketplace#118)
  * [Full changelog](operator-framework/operator-marketplace@7b53305...fedd694)

[1]: openshift/origin#22030
@wking
Copy link
Member Author

wking commented Mar 5, 2019

Do we need to modify the Go fetch of availability zones as well to only return the zones that have a state of "available"?

Like this?

@staebler
Copy link
Contributor

staebler commented Mar 5, 2019

Do we need to modify the Go fetch of availability zones as well to only return the zones that have a state of "available"?

Like this?

Yeah, just like that. It boggles my mind that I made that comment. I can't understand how I could have missed that, particularly given that it is the majority of the changes in the PR.

@@ -3,7 +3,9 @@
data "aws_region" "current" {}

// Fetch a list of available AZs
data "aws_availability_zones" "azs" {}
data "aws_availability_zones" "azs" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm wrong, but with #1121 the available check in go will make sure that the masterAZs chosen were available and az_to_<public/private>subnet maps will make sure we only create machines in AZs chosen by installer

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm wrong, but with #1121 the available check in go...

Yup, the Terraform part of this is code that is hopefully going away soon. But we want the Go part of this long-term, and the Terraform code is still there now, so I think we might as well fix it too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well technically terraform should use the AZs given by the installer ie if the AZ goes out of service between asset generation and terraform call... it would be better if we fail with error that AZ is out of service rather than the AZ not found in the az_to_<public/private>subnet maps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well technically terraform should use the AZs given by the installer...

Agreed.

... ie if the AZ goes out of service between asset generation and terraform call...

I think using Go-provided zones means we can drop this data "aws_availability_zones" section altogether. Guarding against a zone going down is always going to be racy up through actually creating resources in the zone, so I'd rather save a few lines and the mental overhead and just have Terraform assume Go has already done any required validation and blindly consume what it was given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using Go-provided zones means we can drop this data "aws_availability_zones" section altogether. Guarding against a zone going down is always going to be racy up through actually creating resources in the zone, so I'd rather save a few lines and the mental overhead and just have Terraform assume Go has already done any required validation and blindly consume what it was given.

I can get behind that change. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get behind that change. :)

Did you want me to pick that up in this PR? Or submit it as a follow-up once this lands?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using Go-provided zones means we can drop this data "aws_availability_zones" section altogether. Guarding against a zone going down is always going to be racy up through actually creating resources in the zone, so I'd rather save a few lines and the mental overhead and just have Terraform assume Go has already done any required validation and blindly consume what it was given.

I can get behind that change. :)

But that would need some changes esp https://jira.coreos.com/browse/CORS-902, that will push this PR to 4.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get behind that change. :)

Did you want me to pick that up in this PR? Or submit it as a follow-up once this lands?

follow up is fine.

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit d8e7559 into openshift:master Mar 6, 2019
@wking wking deleted the aws-available-zones branch March 6, 2019 02:06
wking added a commit to wking/openshift-release that referenced this pull request Apr 8, 2021
This was originally part of avoiding broken zones, see e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204) and b717933
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random AWS regions for IPI, 2020-01-23, openshift#6833).  But the installer has
had broken-zone avoidence since way back in
openshift/installer@71aef620b6 (pkg/asset/machines/aws: Only return
available zones, 2019-02-07, openshift/installer#1210) I dunno how
reliably AWS sets 'state: impaired' and similar; it didn't seem to
protect us from e8921c3.  But we're getting ready to pivot to using
multiple AWS accounts, which creates two issues with hard-coding
region names in the step:

1. References by name are not stable between accounts.  From the AWS
   docs [1]:

     To ensure that resources are distributed across the Availability
     Zones for a Region, we independently map Availability Zones to
     names for each AWS account. For example, the Availability Zone
     us-east-1a for your AWS account might not be the same location as
     us-east-1a for another AWS account.

   So "aah, us-east-1a is broken, let's use b and c instead" might
   apply to one account but not the other.  And the installer does not
   currently accept zone IDs.

2. References by name may not exist in other accounts.  From the AWS
   docs [1]:

     As Availability Zones grow over time, our ability to expand them
     can become constrained. If this happens, we might restrict you
     from launching an instance in a constrained Availability Zone
     unless you already have an instance in that Availability
     Zone. Eventually, we might also remove the constrained
     Availability Zone from the list of Availability Zones for new
     accounts. Therefore, your account might have a different number
     of available Availability Zones in a Region than another account.

   And it turns out that for some reason they sometimes don't name
   sequentially, e.g. our new account lacks us-west-1a:

     $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1a usw1-az3 available
     us-west-1b usw1-az1 available
     $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1b usw1-az3 available
     us-west-1c usw1-az1 available

   I have no idea why they decided to do that, but we have to work
   with the world as it is ;).

Removing the us-east-1 overrides helps reduce our exposure, although
we are still vulnerable to (2) with the a/b default line.  We'll do
something about that in follow-up work.

Leaving the "which zones?" decision up to the installer would cause it
to try to set up each available zone, and that causes more API
contention and resource consumption than we want.  Background on that
in 51c4a37 (ci-operator/templates/openshift: Explicitly set AWS
availability zones, 2019-03-28, openshift#3285) and d87fffb
(ci-operator/templates/openshift: Drop us-east-1c, 2019-04-26, openshift#3615),
as well as the rejected/rotted-out [2].

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
[2]: openshift/installer#1487
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. platform/aws size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants