Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/asset/machines/worker: Default to MachineSets only where we need them #1487

Closed
wants to merge 1 commit into from

Conversation

wking
Copy link
Member

@wking wking commented Mar 28, 2019

Builds on #1481, I'll rebase after that lands.

And not for additional zones. This will keep us from consuming more NAT gateways and associated EIPs than we need, once subsequent work ties those resources to Machine(Set) consumption. This means that pools where replicas is unset or zero will receive no MachineSets (unless the user explicitly configured zones for the pool), but creating MachineSets later on is something we want to be easy anyway, so I don't see a need to require the installer to inject a template MachineSet into the cluster.

The "In each private subnet" -> "For each private subnet" change is because the NAT gateways currently live in the public subnet, but their purpose is to handle egress from machines in the private subnets.

Spun off from #1481, where this change was contentious. The difference is mostly for folks who run openshift-install create cluster without providing an install-config.yaml. With this approach, that would work in us-east-1, although the users will have to add additional zone infra later if they decide they actually do want something in, say, us-east-1e. With the approach taken in #1481, those users will need to get their EIP limits bumped (or change to using an install-config.yaml) before the install will go through, but they won't have to worry about adding more zone infra on day-2. Personally, I'd rather prioritize out-of-the-box success, and leave it to folks who for some reason need many zones to explicitly configure them (or set them up as day-2 operations).

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2019
@wking wking force-pushed the fewer-default-MachineSets branch from d21aa5c to f71ba29 Compare March 28, 2019 16:55
…them

And not for additional zones.  This will keep us from consuming more
NAT gateways and associated EIPs than we need, once subsequent work
ties those resources to Machine(Set) consumption.  This means that
pools where replicas is unset or zero will receive no MachineSets
(unless the user explicitly configured zones for the pool), but
creating MachineSets later on is something we want to be easy anyway,
so I don't see a need to require the installer to inject a template
MachineSet into the cluster.

The "In each private subnet" -> "For each private subnet" change is
because the NAT gateways currently live *in* the public subnet, but
their purpose is to handle egress from machines in the private
subnets.
@wking wking force-pushed the fewer-default-MachineSets branch from f71ba29 to a9c7453 Compare March 28, 2019 19:11
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 28, 2019
@wking
Copy link
Member Author

wking commented Mar 28, 2019

Rebased onto master with f71ba29 -> a9c7453, now that #1481 has landed.

@wking
Copy link
Member Author

wking commented Mar 28, 2019

images test-platform flake:

release "release-latest" failed: pod release-latest was already deleted

/retest

@smarterclayton
Copy link
Contributor

What risk exists in this?

@abhinavdahiya
Copy link
Contributor

What risk exists in this?

/hold

This restricts us deploying/setting up AZs based on machine count chosen at install-time, even though the users didn't know there were making that choice for them and there is no way to expand the subnets to new AZs Day-2 and that's why i'm against it.

users who specifically chooses less AZs -> already deploy only to that AZs, what i think the better solution that we have in place already.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 1, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-upgrade a9c7453 link /test e2e-aws-upgrade
ci/prow/e2e-aws a9c7453 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor

closing due to inactivity. Please reopen if needed.

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closed this PR.

In response to this:

closing due to inactivity. Please reopen if needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/openshift-release that referenced this pull request Apr 8, 2021
This was originally part of avoiding broken zones, see e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204) and b717933
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random AWS regions for IPI, 2020-01-23, openshift#6833).  But the installer has
had broken-zone avoidence since way back in
openshift/installer@71aef620b6 (pkg/asset/machines/aws: Only return
available zones, 2019-02-07, openshift/installer#1210) I dunno how
reliably AWS sets 'state: impaired' and similar; it didn't seem to
protect us from e8921c3.  But we're getting ready to pivot to using
multiple AWS accounts, which creates two issues with hard-coding
region names in the step:

1. References by name are not stable between accounts.  From the AWS
   docs [1]:

     To ensure that resources are distributed across the Availability
     Zones for a Region, we independently map Availability Zones to
     names for each AWS account. For example, the Availability Zone
     us-east-1a for your AWS account might not be the same location as
     us-east-1a for another AWS account.

   So "aah, us-east-1a is broken, let's use b and c instead" might
   apply to one account but not the other.  And the installer does not
   currently accept zone IDs.

2. References by name may not exist in other accounts.  From the AWS
   docs [1]:

     As Availability Zones grow over time, our ability to expand them
     can become constrained. If this happens, we might restrict you
     from launching an instance in a constrained Availability Zone
     unless you already have an instance in that Availability
     Zone. Eventually, we might also remove the constrained
     Availability Zone from the list of Availability Zones for new
     accounts. Therefore, your account might have a different number
     of available Availability Zones in a Region than another account.

   And it turns out that for some reason they sometimes don't name
   sequentially, e.g. our new account lacks us-west-1a:

     $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1a usw1-az3 available
     us-west-1b usw1-az1 available
     $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1b usw1-az3 available
     us-west-1c usw1-az1 available

   I have no idea why they decided to do that, but we have to work
   with the world as it is ;).

Removing the us-east-1 overrides helps reduce our exposure, although
we are still vulnerable to (2) with the a/b default line.  We'll do
something about that in follow-up work.

Leaving the "which zones?" decision up to the installer would cause it
to try to set up each available zone, and that causes more API
contention and resource consumption than we want.  Background on that
in 51c4a37 (ci-operator/templates/openshift: Explicitly set AWS
availability zones, 2019-03-28, openshift#3285) and d87fffb
(ci-operator/templates/openshift: Drop us-east-1c, 2019-04-26, openshift#3615),
as well as the rejected/rotted-out [2].

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
[2]: openshift/installer#1487
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants