Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85

chizhg · 2021-01-25T00:04:52Z

This PR includes a few changes:

Allow setting backup regions and zones when creating GKE clusters. And when specific errors happen, retry creating the clusters in these backup regions/zones. This is important since a lot of infra flakes are caused by these errors, and there is not a good way to prevent them except retrying, see more details in the internal issue 162609408
Change the log level for logging the commands from 2 to 1, since they are more important than the level 2 logging
Fix a bug when running kubetest2-gke only with --down but not --up flag
A few other small coding style improvements and cleanups

k8s-ci-robot · 2021-01-25T00:05:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chizhg
To complete the pull request process, please assign bentheelder after the PR has been reviewed.
You can assign the PR to them by writing /assign @bentheelder in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

chizhg · 2021-01-25T04:06:25Z

/uncc @cofyc

amwat

I feel like this is the wrong layer to solve stockout issues with fallbacks.
a single kubetest2 invocation should ideally be reproducible.

You can probably fix this at the job level right? by invoking multiple kubetest2 runs in different zones?

kubetest2-gke/deployer/commandutils.go

pkg/exec/exec.go

amwat · 2021-01-25T21:43:36Z

kubetest2-gke/deployer/up.go

 					// Cancel the context to kill other cluster creation processes if any error happens.
 					cancel()
-					return fmt.Errorf("error creating cluster: %v", err)
+					return fmt.Errorf("error creating cluster %q: %w", cluster.name, errors.New(buf.String()))


won't this buffer also have stdout ?

Yeah the output will be written to both stdout and the buffer.

The reasons I'm writing like this are:

We need to capture the output for matching the error pattern to decide if it needs to be retried

We still need to keep the stdout and stderr for informational purpose

Does it make sense to you?

amwat · 2021-01-25T21:46:07Z

kubetest2-gke/deployer/up.go

+			}
+			d.zone = z
+
+			if err = d.createClusters(); err == nil || !isRetryableError(err) {


this messes "down". we would need to delete resources from failed attempts as well.

I don't have a good solution to this comment, but IMO we will only need --backup-regions and --backup-zones in CI since the flakiness when creating clusters is not really important if the users want to run the test flows on their local. And in CI we always have Boskos janitor to clean them up so it seems not a big issue.

Also the down flow is already broken since now it will throw an error when deleting clusters if up fails to create the clusters due to quota/stockout issues, in which the networking and firewall rules will be leaked.

Or we can add an extra check to disallow setting --backup-regions and --backup-zones when CI is not true - https://github.com/kubernetes/test-infra/blob/master/prow/jobs.md#job-environment-variables?

Also if the two backup flags are not set, which is the default, the changes in this PR will just be a no-op. So it's not a breaking Chang for existing test flows.

relying on janitor is not an ideal solution. janitor is only supposed to be used when say jobs are aborted. A successful kubetest2 --up --down should not rely on boskos janitor existing.

Why is the down flow broken? if you mean that cluster delete command fails if cluster wasn't created in the first place then that's WAI?
We definitely don't take care of resources that are brought up behind the scenes during cluster creation, but if we are creating any resources/firewall rules as part of Up then those should be cleaned up in Down otherwise that's a bug.

currently as this stands if we have 2 clusters and 2 zones (1 original and 1 backup)
and cluster 1 is created in zone1 but cluster2 fails
we are retrying everything in zone2
and if both clusters come up in zone2 then cluster1 in zone1 is leaked.

Currently for all the jobs we run with GKE clusters, we are not using the --down flag in order to save time in CI, so we are completely relying on janitor to do the cleanups, and we haven't seen any issues so far.

Why is the down flow broken? if you mean that cluster delete command fails if cluster wasn't created in the first place then that's WAI?
Yes, the networks and firewall rules will be leaked if the cluster wasn't created in the first place, but we have janitor to clean them up so it's not an issue in CI.

It will complicate the logic a lot if we want to support cleaning up these leftover resources due to retrying, and also as I mentioned in another thread, this will be a no-op change if --backup-regions and --backup-zones are empty, which is by default, so it won't impact users who do not need it.

Maybe I can add description to these two flags saying setting them will probably leak resources, please use them with caution, and we can add a TODO to fix it? Or if you are completely not convinced with this change, I'll then update ASM/Istio test-infra, then we'll have this logic duplicated everywhere...

I'm definitely +1 on having a retry mode, but would like to see the suggested changes :) any new feature should be self contained and honor the full kubetest2 cluster lifecycle. We shouldn't be adding features which explicitly rely on also needing janitor.

Is there any reason why all clusters need to be recreated in a new zone, can't we just recreate the ones that fail?
also having backup zones tells me that your tests don't really have any preference of a zone.
Then, we can also make --zone itself take a comma separated list and choose randomly among them (and then use retries in different zones)

I believe for non stockout retriable errors, users would also be interested in retrying cluster creation even for original single zone requests.

tldr; I would love to see this being added as a more general retry feature (with all these considerations) that isn't solely for getting around stockout errors. (stockout errors can still be one of the retriable errors)

Is there any reason why all clusters need to be recreated in a new zone, can't we just recreate the ones that fail?

I can try that, but looks it'll make the Up function even more complicated, and also complicate the logic if we want to do proper cleanups since we'll need to track which regions/zones it tried to create for each cluster. And also for multiclusters we want them to be always created in the same region/zone.

also having backup zones tells me that your tests don't really have any preference of a zone.
Then, we can also make --zone itself take a comma separated list and choose randomly among them (and then use retries in different zones)

Currently for multicluster, we are assuming they should be created in the same region/zone, but in the future we might want them to be created in different regions/zones (the requirement was brought up by one of the engineer partners but it's not a high priority), then we'll need to change --region and --zone to take a comma separated list with its length equal to the number of clusters. That's why I'm not reusing this flag to set the backup regions/zones.

I believe for non stockout retriable errors, users would also be interested in retrying cluster creation even for original single zone requests.

Other errors are mostly due to "the cluster was created, but the health check failed", so we will always get an error if we retry creating the cluster with the same name in the same region/zone. To avoid that we'll need to delete the clusters before recreating them, which is less efficient than directly recreating them in a different region/zone.

I can get your concern and I agree they are all valid, but I'm not sure what'll be the best way to move forward without complicating the code logic here too much.

Yes, I would like to see the retry feature be discussed in more detail (especially for multi cluster) before we try to add anything, maybe as an issue/doc to cover existing+future use cases (like you mentioned).

We can think about reworking the flags to be read from a config file if we want a lot of customizability and have each cluster's lifecycle be handled independently.

I've created #88, please take a look when you have time. Thanks.

chizhg · 2021-01-25T21:57:54Z

I feel like this is the wrong layer to solve stockout issues with fallbacks.
a single kubetest2 invocation should ideally be reproducible.

You can probably fix this at the job level right? by invoking multiple kubetest2 runs in different zones?

This is also my initial thought on this, and we already did it in Knative - https://github.com/knative/test-infra/blob/master/pkg/clustermanager/kubetest2/gke.go#L47.

But we are also seeing this issues in Istio, which is the most common infra failures we've seen in the past few months, and I feel it's common to all the projects that need frequent cluster creation, so I think it makes more sense to make the update upstream instead of duplicating the logic everywhere.

WDYT?

chizhg · 2021-01-25T22:01:19Z

I feel like this is the wrong layer to solve stockout issues with fallbacks.
a single kubetest2 invocation should ideally be reproducible.

You can probably fix this at the job level right? by invoking multiple kubetest2 runs in different zones?

I'm not sure if I correctly understood what you meant by invoking multiple kubetest2 runs in different zones - but we cannot predict the stockout errors, since it only happens when there are not enough machines/resources in that data center to provision the requested nodes, and the stock there is always changing. That's why I believe catch-failure-and-retry is the only pattern to mitigate this issue.

amwat

yeah, I meant the same retry logic that you have done for knative.
how do we plan on handling multiple clusters? if one fails do we recreate all of them in new zones or just the one that failed?
we also need to take care of cleanup for all creation attempts.

amwat · 2021-01-25T23:47:15Z

kubetest2-gke/deployer/up.go

+var retryableCreationErrors = []*regexp.Regexp{
+	regexp.MustCompile(`.*does not have enough resources available to fulfill.*`),
+	regexp.MustCompile(`.*only \d+ nodes out of \d+ have registered; this is likely due to Nodes failing to start correctly.*`),
+	regexp.MustCompile(`.*All cluster resources were brought up, but: component .+ from endpoint .+ is unhealthy.*`),


component unhealthy doesn't seem like a retryable error. we see similar issues many times for legitimate cluster bring up issues.

Do you mean it won't succeed even if we retry in different regions/zones? Based on our experience in Knative retrying usually solves the problem, and that's how we kept the infra flakiness for Knative at a very low rate.

If it works after retrying in a different zone that just seems like a coincidence :)
But that is very much a legitimate issue in most of our gke components testing, which shouldn't be retried.

amwat · 2021-01-25T23:48:20Z

kubetest2-gke/deployer/up.go

+// - nodes fail to start
+// - component is unhealthy
+var retryableCreationErrors = []*regexp.Regexp{
+	regexp.MustCompile(`.*does not have enough resources available to fulfill.*`),


we should differentiate stockout errors from regular quota errors.

same concerns as: #13 (comment)

Ditto. The same reply as #85 (comment)

chizhg · 2021-01-26T01:12:19Z

To add more clarifications to the errors we want to retry here:

.*does not have enough resources available to fulfill.* corresponds to the GCE_STOCKOUT error. As mentioned in Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85 (comment), there is no way to predict and prevent this error, and the only way to mitigate will be retrying the cluster creation in a different region/zone.
.*All cluster resources were brought up, but: component .+ from endpoint .+ is unhealthy.* is probably a legitimate cluster bring up issue as you mentioned in Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85 (comment), but it is the reason for a large portion of infra flakiness, and based on our experience on Knative, retrying the cluster creation in a different region/zone will usually succeed.
.*only \d+ nodes out of \d+ have registered; this is likely due to Nodes failing to start correctly.* is probably similar as the above one, but it also caused lots of infra flakiness without having the retry logic, so I'm adding it for the same reason.

The quota error as you mentioned in #13 (comment) actually has a different error message - Quota exceeded for quota metric 'Requests' and limit 'Requests per minute' of service 'container.googleapis.com', and I definitely agree we should NOT retry on this error. And another version error in #13 (comment) has already been solved by #66 so we don't need to retry for it either.

k8s-ci-robot · 2021-01-30T19:21:27Z

@chizhg: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 25, 2021

k8s-ci-robot requested a review from amwat January 25, 2021 00:04

k8s-ci-robot requested a review from cofyc January 25, 2021 00:05

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 25, 2021

k8s-ci-robot removed the request for review from cofyc January 25, 2021 04:06

amwat reviewed Jan 25, 2021

View reviewed changes

chizhg force-pushed the allow-backup-regions-zones branch from c427495 to 789150e Compare January 25, 2021 23:02

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 25, 2021

amwat reviewed Jan 25, 2021

View reviewed changes

chizhg force-pushed the allow-backup-regions-zones branch from 789150e to 21f47f9 Compare January 26, 2021 03:47

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 26, 2021

chizhg force-pushed the allow-backup-regions-zones branch from 21f47f9 to ee3e813 Compare January 26, 2021 21:29

Allow setting backup regions and zones when creating GKE clusters

75eb6a9

chizhg force-pushed the allow-backup-regions-zones branch from ee3e813 to 75eb6a9 Compare January 26, 2021 21:36

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2021

chizhg mentioned this pull request Feb 1, 2021

[Proposal] gke deployer supports retrying when cluster creation failures happen #88

Closed

chizhg closed this Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85

Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85

chizhg commented Jan 25, 2021 •

edited

Loading

k8s-ci-robot commented Jan 25, 2021

chizhg commented Jan 25, 2021

amwat left a comment

amwat Jan 25, 2021

chizhg Jan 26, 2021

amwat Jan 25, 2021

chizhg Jan 26, 2021 •

edited

Loading

chizhg Jan 26, 2021

chizhg Jan 26, 2021

amwat Jan 26, 2021

chizhg Jan 26, 2021

amwat Jan 27, 2021

chizhg Jan 27, 2021 •

edited

Loading

amwat Jan 28, 2021

chizhg Feb 1, 2021

chizhg commented Jan 25, 2021 •

edited

Loading

chizhg commented Jan 25, 2021 •

edited

Loading

amwat left a comment

amwat Jan 25, 2021

chizhg Jan 26, 2021 •

edited

Loading

amwat Jan 26, 2021

amwat Jan 25, 2021

chizhg Jan 26, 2021

chizhg commented Jan 26, 2021 •

edited

Loading

k8s-ci-robot commented Jan 30, 2021

Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85

Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85

Conversation

chizhg commented Jan 25, 2021 • edited Loading

k8s-ci-robot commented Jan 25, 2021

chizhg commented Jan 25, 2021

amwat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chizhg Jan 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chizhg Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chizhg commented Jan 25, 2021 • edited Loading

chizhg commented Jan 25, 2021 • edited Loading

amwat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chizhg Jan 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chizhg commented Jan 26, 2021 • edited Loading

k8s-ci-robot commented Jan 30, 2021

chizhg commented Jan 25, 2021 •

edited

Loading

chizhg Jan 26, 2021 •

edited

Loading

chizhg Jan 27, 2021 •

edited

Loading

chizhg commented Jan 25, 2021 •

edited

Loading

chizhg commented Jan 25, 2021 •

edited

Loading

chizhg Jan 26, 2021 •

edited

Loading

chizhg commented Jan 26, 2021 •

edited

Loading