Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85

Closed

Conversation

chizhg
Copy link
Contributor

@chizhg chizhg commented Jan 25, 2021

This PR includes a few changes:

  1. Allow setting backup regions and zones when creating GKE clusters. And when specific errors happen, retry creating the clusters in these backup regions/zones. This is important since a lot of infra flakes are caused by these errors, and there is not a good way to prevent them except retrying, see more details in the internal issue 162609408

  2. Change the log level for logging the commands from 2 to 1, since they are more important than the level 2 logging

  3. Fix a bug when running kubetest2-gke only with --down but not --up flag

  4. A few other small coding style improvements and cleanups

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 25, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chizhg
To complete the pull request process, please assign bentheelder after the PR has been reviewed.
You can assign the PR to them by writing /assign @bentheelder in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 25, 2021
@chizhg
Copy link
Contributor Author

chizhg commented Jan 25, 2021

/uncc @cofyc

@k8s-ci-robot k8s-ci-robot removed the request for review from cofyc January 25, 2021 04:06
Copy link
Contributor

@amwat amwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is the wrong layer to solve stockout issues with fallbacks.
a single kubetest2 invocation should ideally be reproducible.

You can probably fix this at the job level right? by invoking multiple kubetest2 runs in different zones?

kubetest2-gke/deployer/commandutils.go Outdated Show resolved Hide resolved
pkg/exec/exec.go Outdated Show resolved Hide resolved
// Cancel the context to kill other cluster creation processes if any error happens.
cancel()
return fmt.Errorf("error creating cluster: %v", err)
return fmt.Errorf("error creating cluster %q: %w", cluster.name, errors.New(buf.String()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this buffer also have stdout ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the output will be written to both stdout and the buffer.

The reasons I'm writing like this are:

  1. We need to capture the output for matching the error pattern to decide if it needs to be retried
  2. We still need to keep the stdout and stderr for informational purpose

Does it make sense to you?

}
d.zone = z

if err = d.createClusters(); err == nil || !isRetryableError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this messes "down". we would need to delete resources from failed attempts as well.

Copy link
Contributor Author

@chizhg chizhg Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good solution to this comment, but IMO we will only need --backup-regions and --backup-zones in CI since the flakiness when creating clusters is not really important if the users want to run the test flows on their local. And in CI we always have Boskos janitor to clean them up so it seems not a big issue.

Also the down flow is already broken since now it will throw an error when deleting clusters if up fails to create the clusters due to quota/stockout issues, in which the networking and firewall rules will be leaked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can add an extra check to disallow setting --backup-regions and --backup-zones when CI is not true - https://github.com/kubernetes/test-infra/blob/master/prow/jobs.md#job-environment-variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if the two backup flags are not set, which is the default, the changes in this PR will just be a no-op. So it's not a breaking Chang for existing test flows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relying on janitor is not an ideal solution. janitor is only supposed to be used when say jobs are aborted. A successful kubetest2 --up --down should not rely on boskos janitor existing.

Why is the down flow broken? if you mean that cluster delete command fails if cluster wasn't created in the first place then that's WAI?
We definitely don't take care of resources that are brought up behind the scenes during cluster creation, but if we are creating any resources/firewall rules as part of Up then those should be cleaned up in Down otherwise that's a bug.

currently as this stands if we have 2 clusters and 2 zones (1 original and 1 backup)
and cluster 1 is created in zone1 but cluster2 fails
we are retrying everything in zone2
and if both clusters come up in zone2 then cluster1 in zone1 is leaked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for all the jobs we run with GKE clusters, we are not using the --down flag in order to save time in CI, so we are completely relying on janitor to do the cleanups, and we haven't seen any issues so far.

Why is the down flow broken? if you mean that cluster delete command fails if cluster wasn't created in the first place then that's WAI?
Yes, the networks and firewall rules will be leaked if the cluster wasn't created in the first place, but we have janitor to clean them up so it's not an issue in CI.

It will complicate the logic a lot if we want to support cleaning up these leftover resources due to retrying, and also as I mentioned in another thread, this will be a no-op change if --backup-regions and --backup-zones are empty, which is by default, so it won't impact users who do not need it.

Maybe I can add description to these two flags saying setting them will probably leak resources, please use them with caution, and we can add a TODO to fix it? Or if you are completely not convinced with this change, I'll then update ASM/Istio test-infra, then we'll have this logic duplicated everywhere...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely +1 on having a retry mode, but would like to see the suggested changes :) any new feature should be self contained and honor the full kubetest2 cluster lifecycle. We shouldn't be adding features which explicitly rely on also needing janitor.

Is there any reason why all clusters need to be recreated in a new zone, can't we just recreate the ones that fail?
also having backup zones tells me that your tests don't really have any preference of a zone.
Then, we can also make --zone itself take a comma separated list and choose randomly among them (and then use retries in different zones)

I believe for non stockout retriable errors, users would also be interested in retrying cluster creation even for original single zone requests.

tldr; I would love to see this being added as a more general retry feature (with all these considerations) that isn't solely for getting around stockout errors. (stockout errors can still be one of the retriable errors)

Copy link
Contributor Author

@chizhg chizhg Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why all clusters need to be recreated in a new zone, can't we just recreate the ones that fail?

I can try that, but looks it'll make the Up function even more complicated, and also complicate the logic if we want to do proper cleanups since we'll need to track which regions/zones it tried to create for each cluster. And also for multiclusters we want them to be always created in the same region/zone.

also having backup zones tells me that your tests don't really have any preference of a zone.
Then, we can also make --zone itself take a comma separated list and choose randomly among them (and then use retries in different zones)

Currently for multicluster, we are assuming they should be created in the same region/zone, but in the future we might want them to be created in different regions/zones (the requirement was brought up by one of the engineer partners but it's not a high priority), then we'll need to change --region and --zone to take a comma separated list with its length equal to the number of clusters. That's why I'm not reusing this flag to set the backup regions/zones.

I believe for non stockout retriable errors, users would also be interested in retrying cluster creation even for original single zone requests.

Other errors are mostly due to "the cluster was created, but the health check failed", so we will always get an error if we retry creating the cluster with the same name in the same region/zone. To avoid that we'll need to delete the clusters before recreating them, which is less efficient than directly recreating them in a different region/zone.

I can get your concern and I agree they are all valid, but I'm not sure what'll be the best way to move forward without complicating the code logic here too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would like to see the retry feature be discussed in more detail (especially for multi cluster) before we try to add anything, maybe as an issue/doc to cover existing+future use cases (like you mentioned).

We can think about reworking the flags to be read from a config file if we want a lot of customizability and have each cluster's lifecycle be handled independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #88, please take a look when you have time. Thanks.

@chizhg
Copy link
Contributor Author

chizhg commented Jan 25, 2021

I feel like this is the wrong layer to solve stockout issues with fallbacks.
a single kubetest2 invocation should ideally be reproducible.

You can probably fix this at the job level right? by invoking multiple kubetest2 runs in different zones?

This is also my initial thought on this, and we already did it in Knative - https://github.com/knative/test-infra/blob/master/pkg/clustermanager/kubetest2/gke.go#L47.

But we are also seeing this issues in Istio, which is the most common infra failures we've seen in the past few months, and I feel it's common to all the projects that need frequent cluster creation, so I think it makes more sense to make the update upstream instead of duplicating the logic everywhere.

WDYT?

@chizhg
Copy link
Contributor Author

chizhg commented Jan 25, 2021

I feel like this is the wrong layer to solve stockout issues with fallbacks.
a single kubetest2 invocation should ideally be reproducible.

You can probably fix this at the job level right? by invoking multiple kubetest2 runs in different zones?

I'm not sure if I correctly understood what you meant by invoking multiple kubetest2 runs in different zones - but we cannot predict the stockout errors, since it only happens when there are not enough machines/resources in that data center to provision the requested nodes, and the stock there is always changing. That's why I believe catch-failure-and-retry is the only pattern to mitigate this issue.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 25, 2021
Copy link
Contributor

@amwat amwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I meant the same retry logic that you have done for knative.
how do we plan on handling multiple clusters? if one fails do we recreate all of them in new zones or just the one that failed?
we also need to take care of cleanup for all creation attempts.

var retryableCreationErrors = []*regexp.Regexp{
regexp.MustCompile(`.*does not have enough resources available to fulfill.*`),
regexp.MustCompile(`.*only \d+ nodes out of \d+ have registered; this is likely due to Nodes failing to start correctly.*`),
regexp.MustCompile(`.*All cluster resources were brought up, but: component .+ from endpoint .+ is unhealthy.*`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

component unhealthy doesn't seem like a retryable error. we see similar issues many times for legitimate cluster bring up issues.

Copy link
Contributor Author

@chizhg chizhg Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean it won't succeed even if we retry in different regions/zones? Based on our experience in Knative retrying usually solves the problem, and that's how we kept the infra flakiness for Knative at a very low rate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works after retrying in a different zone that just seems like a coincidence :)
But that is very much a legitimate issue in most of our gke components testing, which shouldn't be retried.

// - nodes fail to start
// - component is unhealthy
var retryableCreationErrors = []*regexp.Regexp{
regexp.MustCompile(`.*does not have enough resources available to fulfill.*`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should differentiate stockout errors from regular quota errors.

same concerns as: #13 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. The same reply as #85 (comment)

@chizhg
Copy link
Contributor Author

chizhg commented Jan 26, 2021

To add more clarifications to the errors we want to retry here:

  1. .*does not have enough resources available to fulfill.* corresponds to the GCE_STOCKOUT error. As mentioned in Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85 (comment), there is no way to predict and prevent this error, and the only way to mitigate will be retrying the cluster creation in a different region/zone.

  2. .*All cluster resources were brought up, but: component .+ from endpoint .+ is unhealthy.* is probably a legitimate cluster bring up issue as you mentioned in Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85 (comment), but it is the reason for a large portion of infra flakiness, and based on our experience on Knative, retrying the cluster creation in a different region/zone will usually succeed.

  3. .*only \d+ nodes out of \d+ have registered; this is likely due to Nodes failing to start correctly.* is probably similar as the above one, but it also caused lots of infra flakiness without having the retry logic, so I'm adding it for the same reason.

The quota error as you mentioned in #13 (comment) actually has a different error message - Quota exceeded for quota metric 'Requests' and limit 'Requests per minute' of service 'container.googleapis.com', and I definitely agree we should NOT retry on this error. And another version error in #13 (comment) has already been solved by #66 so we don't need to retry for it either.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 26, 2021
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2021
@k8s-ci-robot
Copy link
Contributor

@chizhg: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants