Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data/aws: use nlbs instead of elbs #594

Merged
merged 4 commits into from
Nov 2, 2018
Merged

data/aws: use nlbs instead of elbs #594

merged 4 commits into from
Nov 2, 2018

Conversation

crawford
Copy link
Contributor

@crawford crawford commented Nov 1, 2018

We've noticed an elevated rate of installation failures recently. The
root cause appears to be 50-90 seconds of latency added to traffic going
through the internal ELB on port 49500. This was causing Ignition's
connections to timeout, resulting in the machines never provisioning.

AWS's NLBs don't seem to have this high latency, so we've decided to
move over to them instead. With the move to NLBs, we also get the
ability to add individual health checks for each port instead of just a
single health check for each load balancer. Also, NLBs are cheaper.

This commit drops support for ingress and the console. Since the console
and router aren't currently configured correctly, nobody should notice
that this is gone. It was easier to drop support in this commit rather
than continue to try to plumb through the existing implementation
knowing that it was going to have to change in the future. Once the
router has a strategy for ingress, we'll re-add this functionality using
the new NLBs.

This also drop support for the <cluster-name>-k8s DNS entry. We aren't
aware of any consumers and it was going to be tedious to keep this
working.

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 1, 2018
These variables and outputs aren't used anywhere and can be removed.
The existing version would sometimes panic when creating NLBs and was
also missing a few options specific to NLBs. There are probably many
other fixes and improvements that come with the newer provider.
This gives the installer the ability to tear down NLBs.
We've noticed an elevated rate of installation failures recently. The
root cause appears to be 50-90 seconds of latency added to traffic going
through the internal ELB on port 49500. This was causing Ignition's
connections to timeout, resulting in the machines never provisioning.

AWS's NLBs don't seem to have this high latency, so we've decided to
move over to them instead. With the move to NLBs, we also get the
ability to add individual health checks for each port instead of just a
single health check for each load balancer. Also, NLBs are cheaper.

This commit drops support for ingress and the console. Since the console
and router aren't currently configured correctly, nobody should notice
that this is gone. It was easier to drop support in this commit rather
than continue to try to plumb through the existing implementation
knowing that it was going to have to change in the future. Once the
router has a strategy for ingress, we'll re-add this functionality using
the new NLBs.

This also drop support for the `<cluster-name>-k8s` DNS entry. We aren't
aware of any consumers and it was going to be tedious to keep this
working.
@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@crawford
Copy link
Contributor Author

crawford commented Nov 2, 2018

/hold

Waiting for limit bump on NLBs from AWS.

Also, @ironcladlou, does the removal of the ingress LBs block your work? We can add an ALB for router traffic if needed.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2018
@ironcladlou
Copy link
Contributor

@crawford

Also, @ironcladlou, does the removal of the ingress LBs block your work? We can add an ALB for router traffic if needed.

This change doesn't impact ingress today; routers are exposed by ELBs provisioned by the k8s service controller, and we haven't got to external app router DNS setup yet. Thanks for the heads up!

@crawford
Copy link
Contributor Author

crawford commented Nov 2, 2018

Still waiting on AWS, but the current limit is 20 NLBs, which gets us 10 concurrent clusters. I think that's good enough for now.

@ironcladlou Okay, I'm going to go ahead and cancel the hold.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2018
@openshift-merge-robot openshift-merge-robot merged commit 1e2b796 into openshift:master Nov 2, 2018
@crawford crawford deleted the nlb branch November 2, 2018 16:36
wking added a commit to wking/openshift-installer that referenced this pull request Nov 2, 2018
Through 1e2b796 (Merge pull request openshift#594 from crawford/nlb,
2018-11-02).
wking added a commit to wking/openshift-installer that referenced this pull request Nov 4, 2018
The listeners may be private or public, but as far as the bootstrap
and master modules are concerned, these are just lists of ARNs that
need attaching.  This partially unwinds changes from 16dfbb3
(data/aws: use nlbs instead of elbs, 2018-01-01, openshift#594), before which
the bootstrap module used a unified list of load balancers while the
master module made private/public distinctions.
wking added a commit to wking/openshift-installer that referenced this pull request Nov 4, 2018
The listeners may be private or public, but as far as the bootstrap
and master modules are concerned, these are just lists of ARNs that
need attaching.  This partially unwinds changes from 16dfbb3
(data/aws: use nlbs instead of elbs, 2018-01-01, openshift#594), before which
the bootstrap module used a unified list of load balancers while the
master module made private/public distinctions.

While I'm touching variables.tf, I've alphabetized the dns_server_ip
and kubeconfig_content entries as well.
wking added a commit to wking/openshift-installer that referenced this pull request Nov 4, 2018
The listeners may be private or public, but as far as the bootstrap
and master modules are concerned, these are just lists of ARNs that
need attaching.  This partially unwinds changes from 16dfbb3
(data/aws: use nlbs instead of elbs, 2018-01-01, openshift#594), before which
the bootstrap module used a unified list of load balancers while the
master module made private/public distinctions.

While I'm touching variables.tf, I've alphabetized the dns_server_ip
and kubeconfig_content entries as well.
wking added a commit to wking/openshift-installer that referenced this pull request Nov 4, 2018
The listeners may be private or public, but as far as the bootstrap
and master modules are concerned, these are just lists of ARNs that
need attaching.  This partially unwinds changes from 16dfbb3
(data/aws: use nlbs instead of elbs, 2018-01-01, openshift#594), before which
the bootstrap module used a unified list of load balancers while the
master module made private/public distinctions.

While I'm touching variables.tf, I've alphabetized the dns_server_ip
and kubeconfig_content entries as well.
wking added a commit to wking/openshift-installer that referenced this pull request Nov 5, 2018
The listeners may be private or public, but as far as the bootstrap
and master modules are concerned, these are just lists of ARNs that
need attaching.  This partially unwinds changes from 16dfbb3
(data/aws: use nlbs instead of elbs, 2018-01-01, openshift#594), before which
the bootstrap module used a unified list of load balancers while the
master module made private/public distinctions.

While I'm touching variables.tf, I've alphabetized the dns_server_ip
and kubeconfig_content entries as well.
wking added a commit to wking/openshift-installer that referenced this pull request Nov 15, 2018
It's the default.  The old parameter was deprecated before
hashicorp/terraform-provider-aws@1cc81c92
(docs/data-source/aws_region: Remove now deprecated current argument,
2018-02-09, v1.9.0), and we're pinning 1.39.0 since ac5aeed
(data/aws: bump aws provider, 2018-11-01, openshift#594).  Changing this
avoids:

  $ openshift-install --log-level=debug create cluster
  ...
  - Downloading plugin for provider "aws" (1.39.0)...
  ...
  Warning: module.vpc.data.aws_region.current: "current": [DEPRECATED] Defaults to current provider region if no other filtering is enabled
  ...
wking added a commit to wking/openshift-installer that referenced this pull request Dec 16, 2018
As suggested by Dani Comnea [1].  When we switched to network load
balancers in 16dfbb3 (data/aws: use nlbs instead of elbs,
2018-11-01, openshift#594), we replaced things like:

  resource "aws_elb" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 2
      unhealthy_threshold = 2
      timeout             = 3
      target              = "SSL:6443"
      interval            = 5
    }
    ...
  }

with:

  resource "aws_lb_target_group" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 3
      unhealthy_threshold = 3
      interval            = 10
      port                = 6443
      protocol            = "TCP"
    }
  }

This resulted in logs like [2]:

  [core@ip-10-0-11-88 ~]$ sudo crictl ps
  CONTAINER ID        IMAGE                                                                                                                                           CREATED             STATE               NAME                    ATTEMPT
  1bf4870ea6eea       registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56   2 minutes ago       Running             machine-config-server   0
  [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea
  I1215 20:23:07.088210       1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty
  I1215 20:23:07.088554       1 api.go:54] launching server
  I1215 20:23:07.088571       1 api.go:54] launching server
  2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF
  ...

when the health check opens a TCP connection (in this case to the
machine-config server on 49500) and then hangs up without completing
the TLS handshake.  Network load balancers [3,4] do not have an analog
to the classic load balancers' SSL protocol [5,6,7], so we're using
HTTPS.

There's some discussion in [8] about the best way to perform
unauthenticated liveness checks on the Kubernetes API server.  For
now, I'm assuming that both 200 and 401 responses to /healthz requests
indicate a functional server, and we can evaluate other response
status codes as necessary.  Checking against a recent cluster:

  $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz
  HTTP/1.1 200 OK
  Cache-Control: no-store
  Date: Sun, 16 Dec 2018 06:18:23 GMT
  Content-Length: 2
  Content-Type: text/plain; charset=utf-8

I don't know if the network load balancer health checks care about
certificate validity or not.  I guess we'll see how CI testing handles
this.

Ignition is only exposed inside the cluster, and checking that from a
master node:

  [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/
  HTTP/1.1 404 Not Found
  Content-Type: text/plain; charset=utf-8
  X-Content-Type-Options: nosniff
  Date: Sun, 16 Dec 2018 06:30:14 GMT
  Content-Length: 19

  404 page not found

So I've allowed 200, 401, and 404 there just to be generous.

[1]: openshift#923
[2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ
[3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol
[5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html
[6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target
[7]: hashicorp/terraform-provider-aws#6866
[8]: kubernetes/kubernetes#43784
wking added a commit to wking/openshift-installer that referenced this pull request Dec 16, 2018
As suggested by Dani Comnea [1].  When we switched to network load
balancers in 16dfbb3 (data/aws: use nlbs instead of elbs,
2018-11-01, openshift#594), we replaced things like:

  resource "aws_elb" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 2
      unhealthy_threshold = 2
      timeout             = 3
      target              = "SSL:6443"
      interval            = 5
    }
    ...
  }

with:

  resource "aws_lb_target_group" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 3
      unhealthy_threshold = 3
      interval            = 10
      port                = 6443
      protocol            = "TCP"
    }
  }

This resulted in logs like [2]:

  [core@ip-10-0-11-88 ~]$ sudo crictl ps
  CONTAINER ID        IMAGE                                                                                                                                           CREATED             STATE               NAME                    ATTEMPT
  1bf4870ea6eea       registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56   2 minutes ago       Running             machine-config-server   0
  [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea
  I1215 20:23:07.088210       1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty
  I1215 20:23:07.088554       1 api.go:54] launching server
  I1215 20:23:07.088571       1 api.go:54] launching server
  2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF
  ...

when the health check opens a TCP connection (in this case to the
machine-config server on 49500) and then hangs up without completing
the TLS handshake.  Network load balancers [3,4] do not have an analog
to the classic load balancers' SSL protocol [5,6,7], so we're using
HTTPS.

There's some discussion in [8] about the best way to perform
unauthenticated liveness checks on the Kubernetes API server.  For
now, I'm assuming that both 200 and 401 responses to /healthz requests
indicate a functional server, and we can evaluate other response
status codes as necessary.  Checking against a recent cluster:

  $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz
  HTTP/1.1 200 OK
  Cache-Control: no-store
  Date: Sun, 16 Dec 2018 06:18:23 GMT
  Content-Length: 2
  Content-Type: text/plain; charset=utf-8

I don't know if the network load balancer health checks care about
certificate validity or not.  I guess we'll see how CI testing handles
this.

Ignition is only exposed inside the cluster, and checking that from a
master node:

  [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/
  HTTP/1.1 404 Not Found
  Content-Type: text/plain; charset=utf-8
  X-Content-Type-Options: nosniff
  Date: Sun, 16 Dec 2018 06:30:14 GMT
  Content-Length: 19

  404 page not found

Unfortunately, setting matcher [9] is not allowed for network load
balancers (e.g. see [10,11]).  Setting it leads to errors like:

  ERROR  * module.vpc.aws_lb_target_group.api_internal: 1 error occurred:
  ERROR  * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol
  ERROR  status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c

So I've left it unset here, and we'll just hope the 401s don't start
happening.

[1]: openshift#923
[2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ
[3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol
[5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html
[6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target
[7]: hashicorp/terraform-provider-aws#6866
[8]: kubernetes/kubernetes#43784
[9]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#matcher
[10]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612
[11]: hashicorp/terraform-provider-aws#2708 (comment)
wking added a commit to wking/openshift-installer that referenced this pull request Jan 9, 2019
As suggested by Dani Comnea [1].  When we switched to network load
balancers in 16dfbb3 (data/aws: use nlbs instead of elbs,
2018-11-01, openshift#594), we replaced things like:

  resource "aws_elb" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 2
      unhealthy_threshold = 2
      timeout             = 3
      target              = "SSL:6443"
      interval            = 5
    }
    ...
  }

with:

  resource "aws_lb_target_group" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 3
      unhealthy_threshold = 3
      interval            = 10
      port                = 6443
      protocol            = "TCP"
    }
  }

This resulted in logs like [2]:

  [core@ip-10-0-11-88 ~]$ sudo crictl ps
  CONTAINER ID        IMAGE                                                                                                                                           CREATED             STATE               NAME                    ATTEMPT
  1bf4870ea6eea       registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56   2 minutes ago       Running             machine-config-server   0
  [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea
  I1215 20:23:07.088210       1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty
  I1215 20:23:07.088554       1 api.go:54] launching server
  I1215 20:23:07.088571       1 api.go:54] launching server
  2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF
  ...

when the health check opens a TCP connection (in this case to the
machine-config server on 49500) and then hangs up without completing
the TLS handshake.  Network load balancers [3,4] do not have an analog
to the classic load balancers' SSL protocol [5,6,7], so we're using
HTTPS.

There's some discussion in [8] about the best way to perform
unauthenticated liveness checks on the Kubernetes API server that
suggests 401s are possible in some configurations.  Checking against a
recent cluster:

  $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz
  HTTP/1.1 200 OK
  Cache-Control: no-store
  Date: Sun, 16 Dec 2018 06:18:23 GMT
  Content-Length: 2
  Content-Type: text/plain; charset=utf-8

  ok

I don't know if the network load balancer health checks care about
certificate validity or not.  I guess we'll see how CI testing handles
this.

Ignition is only exposed inside the cluster, and checking that from a
master node:

  [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/
  HTTP/1.1 404 Not Found
  Content-Type: text/plain; charset=utf-8
  X-Content-Type-Options: nosniff
  Date: Sun, 16 Dec 2018 06:30:14 GMT
  Content-Length: 19

  404 page not found

So we're checking the new /healthz from
openshift/machine-config-operator@d0a7ae21 (server: Add /healthz,
2019-01-04, openshift/machine-config-operator#267) instead.

Unfortunately, setting matcher [9] is not allowed for network load
balancers (e.g. see [10,11]).  Setting it leads to errors like:

  ERROR  * module.vpc.aws_lb_target_group.api_internal: 1 error occurred:
  ERROR  * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol
  ERROR  status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c

So I've left it unset here, and we'll just hope the 401s don't start
happening.

[1]: openshift#923
[2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ
[3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol
[5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html
[6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target
[7]: hashicorp/terraform-provider-aws#6866
[8]: kubernetes/kubernetes#43784
[9]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#matcher
[10]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612
[11]: hashicorp/terraform-provider-aws#2708 (comment)
wking added a commit to wking/openshift-installer that referenced this pull request Apr 16, 2019
These were changes from two to three in 16dfbb3 (data/aws: use nlbs
instead of elbs, 2019-11-01, openshift#594):

  $ git show 16dfbb3 | grep _threshold | sort | uniq
  -    healthy_threshold   = 2
  -  #   healthy_threshold   = 2
  +    healthy_threshold   = 3
  -    unhealthy_threshold = 2
  -  #   unhealthy_threshold = 2
  +    unhealthy_threshold = 3

Alex doesn't remember intentionally making that change, and the lower
thresholds will hopefully help mitigate some issues with continued
connection attempts to API servers which are in the process of
shutting down.
wking added a commit to wking/openshift-installer that referenced this pull request Aug 28, 2019
It is not valid for network load balancers [1].  We've been setting
this in our Terraform config since 0a96415 (modules/aws: configure
ELB idle timeout, coreos/tectonic-installer#725) when we were using
classic load balancers and should have dropped it in 16dfbb3
(data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594).

[1]: hashicorp/terraform-provider-aws@d25a227#diff-f4b0dbdc7e3eede6ba70cd286c834f37R78
wking added a commit to wking/openshift-installer that referenced this pull request Aug 28, 2019
It is not valid for network load balancers [1].  We've been setting
this in our Terraform config since 0a96415 (modules/aws: configure
ELB idle timeout, coreos/tectonic-installer#725) when we were using
classic load balancers and should have dropped it in 16dfbb3
(data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594).

[1]: hashicorp/terraform-provider-aws@d25a227#diff-f4b0dbdc7e3eede6ba70cd286c834f37R78
alaypatel07 pushed a commit to alaypatel07/installer that referenced this pull request Nov 13, 2019
It is not valid for network load balancers [1].  We've been setting
this in our Terraform config since 0a96415 (modules/aws: configure
ELB idle timeout, coreos/tectonic-installer#725) when we were using
classic load balancers and should have dropped it in 16dfbb3
(data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594).

[1]: hashicorp/terraform-provider-aws@d25a227#diff-f4b0dbdc7e3eede6ba70cd286c834f37R78
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
It is not valid for network load balancers [1].  We've been setting
this in our Terraform config since 0a96415 (modules/aws: configure
ELB idle timeout, coreos/tectonic-installer#725) when we were using
classic load balancers and should have dropped it in 16dfbb3
(data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594).

[1]: hashicorp/terraform-provider-aws@d25a227#diff-f4b0dbdc7e3eede6ba70cd286c834f37R78
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants