-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: chizhg The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/uncc @cofyc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is the wrong layer to solve stockout issues with fallbacks.
a single kubetest2 invocation should ideally be reproducible.
You can probably fix this at the job level right? by invoking multiple kubetest2 runs in different zones?
kubetest2-gke/deployer/up.go
Outdated
// Cancel the context to kill other cluster creation processes if any error happens. | ||
cancel() | ||
return fmt.Errorf("error creating cluster: %v", err) | ||
return fmt.Errorf("error creating cluster %q: %w", cluster.name, errors.New(buf.String())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this buffer also have stdout ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the output will be written to both stdout and the buffer.
The reasons I'm writing like this are:
- We need to capture the output for matching the error pattern to decide if it needs to be retried
- We still need to keep the stdout and stderr for informational purpose
Does it make sense to you?
} | ||
d.zone = z | ||
|
||
if err = d.createClusters(); err == nil || !isRetryableError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this messes "down". we would need to delete resources from failed attempts as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good solution to this comment, but IMO we will only need --backup-regions
and --backup-zones
in CI since the flakiness when creating clusters is not really important if the users want to run the test flows on their local. And in CI we always have Boskos janitor to clean them up so it seems not a big issue.
Also the down
flow is already broken since now it will throw an error when deleting clusters if up
fails to create the clusters due to quota/stockout issues, in which the networking and firewall rules will be leaked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can add an extra check to disallow setting --backup-regions
and --backup-zones
when CI
is not true - https://github.com/kubernetes/test-infra/blob/master/prow/jobs.md#job-environment-variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if the two backup flags are not set, which is the default, the changes in this PR will just be a no-op. So it's not a breaking Chang for existing test flows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relying on janitor is not an ideal solution. janitor is only supposed to be used when say jobs are aborted. A successful kubetest2 --up --down should not rely on boskos janitor existing.
Why is the down flow broken? if you mean that cluster delete command fails if cluster wasn't created in the first place then that's WAI?
We definitely don't take care of resources that are brought up behind the scenes during cluster creation, but if we are creating any resources/firewall rules as part of Up
then those should be cleaned up in Down
otherwise that's a bug.
currently as this stands if we have 2 clusters and 2 zones (1 original and 1 backup)
and cluster 1 is created in zone1 but cluster2 fails
we are retrying everything in zone2
and if both clusters come up in zone2 then cluster1 in zone1 is leaked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently for all the jobs we run with GKE clusters, we are not using the --down
flag in order to save time in CI, so we are completely relying on janitor to do the cleanups, and we haven't seen any issues so far.
Why is the down flow broken? if you mean that cluster delete command fails if cluster wasn't created in the first place then that's WAI?
Yes, the networks and firewall rules will be leaked if the cluster wasn't created in the first place, but we have janitor to clean them up so it's not an issue in CI.
It will complicate the logic a lot if we want to support cleaning up these leftover resources due to retrying, and also as I mentioned in another thread, this will be a no-op change if --backup-regions
and --backup-zones
are empty, which is by default, so it won't impact users who do not need it.
Maybe I can add description to these two flags saying setting them will probably leak resources, please use them with caution
, and we can add a TODO
to fix it? Or if you are completely not convinced with this change, I'll then update ASM/Istio test-infra, then we'll have this logic duplicated everywhere...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely +1 on having a retry mode, but would like to see the suggested changes :) any new feature should be self contained and honor the full kubetest2 cluster lifecycle. We shouldn't be adding features which explicitly rely on also needing janitor.
Is there any reason why all clusters need to be recreated in a new zone, can't we just recreate the ones that fail?
also having backup zones tells me that your tests don't really have any preference of a zone.
Then, we can also make --zone itself take a comma separated list and choose randomly among them (and then use retries in different zones)
I believe for non stockout retriable errors, users would also be interested in retrying cluster creation even for original single zone requests.
tldr; I would love to see this being added as a more general retry feature (with all these considerations) that isn't solely for getting around stockout errors. (stockout errors can still be one of the retriable errors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why all clusters need to be recreated in a new zone, can't we just recreate the ones that fail?
I can try that, but looks it'll make the Up function even more complicated, and also complicate the logic if we want to do proper cleanups since we'll need to track which regions/zones it tried to create for each cluster. And also for multiclusters we want them to be always created in the same region/zone.
also having backup zones tells me that your tests don't really have any preference of a zone.
Then, we can also make --zone itself take a comma separated list and choose randomly among them (and then use retries in different zones)
Currently for multicluster, we are assuming they should be created in the same region/zone, but in the future we might want them to be created in different regions/zones (the requirement was brought up by one of the engineer partners but it's not a high priority), then we'll need to change --region
and --zone
to take a comma separated list with its length equal to the number of clusters. That's why I'm not reusing this flag to set the backup regions/zones.
I believe for non stockout retriable errors, users would also be interested in retrying cluster creation even for original single zone requests.
Other errors are mostly due to "the cluster was created, but the health check failed", so we will always get an error if we retry creating the cluster with the same name in the same region/zone. To avoid that we'll need to delete the clusters before recreating them, which is less efficient than directly recreating them in a different region/zone.
I can get your concern and I agree they are all valid, but I'm not sure what'll be the best way to move forward without complicating the code logic here too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would like to see the retry feature be discussed in more detail (especially for multi cluster) before we try to add anything, maybe as an issue/doc to cover existing+future use cases (like you mentioned).
We can think about reworking the flags to be read from a config file if we want a lot of customizability and have each cluster's lifecycle be handled independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created #88, please take a look when you have time. Thanks.
This is also my initial thought on this, and we already did it in Knative - https://github.com/knative/test-infra/blob/master/pkg/clustermanager/kubetest2/gke.go#L47. But we are also seeing this issues in Istio, which is the most common infra failures we've seen in the past few months, and I feel it's common to all the projects that need frequent cluster creation, so I think it makes more sense to make the update upstream instead of duplicating the logic everywhere. WDYT? |
I'm not sure if I correctly understood what you meant by |
c427495
to
789150e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I meant the same retry logic that you have done for knative.
how do we plan on handling multiple clusters? if one fails do we recreate all of them in new zones or just the one that failed?
we also need to take care of cleanup for all creation attempts.
var retryableCreationErrors = []*regexp.Regexp{ | ||
regexp.MustCompile(`.*does not have enough resources available to fulfill.*`), | ||
regexp.MustCompile(`.*only \d+ nodes out of \d+ have registered; this is likely due to Nodes failing to start correctly.*`), | ||
regexp.MustCompile(`.*All cluster resources were brought up, but: component .+ from endpoint .+ is unhealthy.*`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
component unhealthy doesn't seem like a retryable error. we see similar issues many times for legitimate cluster bring up issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean it won't succeed even if we retry in different regions/zones? Based on our experience in Knative retrying usually solves the problem, and that's how we kept the infra flakiness for Knative at a very low rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works after retrying in a different zone that just seems like a coincidence :)
But that is very much a legitimate issue in most of our gke components testing, which shouldn't be retried.
// - nodes fail to start | ||
// - component is unhealthy | ||
var retryableCreationErrors = []*regexp.Regexp{ | ||
regexp.MustCompile(`.*does not have enough resources available to fulfill.*`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should differentiate stockout errors from regular quota errors.
same concerns as: #13 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. The same reply as #85 (comment)
To add more clarifications to the errors we want to retry here:
The quota error as you mentioned in #13 (comment) actually has a different error message - |
789150e
to
21f47f9
Compare
21f47f9
to
ee3e813
Compare
ee3e813
to
75eb6a9
Compare
@chizhg: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR includes a few changes:
Allow setting backup regions and zones when creating GKE clusters. And when specific errors happen, retry creating the clusters in these backup regions/zones. This is important since a lot of infra flakes are caused by these errors, and there is not a good way to prevent them except retrying, see more details in the internal issue 162609408
Change the log level for logging the commands from
2
to1
, since they are more important than the level2
loggingFix a bug when running kubetest2-gke only with
--down
but not--up
flagA few other small coding style improvements and cleanups