Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use region specific API calls with VMSS #1850

Merged
merged 1 commit into from
Nov 12, 2021

Conversation

devigned
Copy link
Contributor

@devigned devigned commented Nov 10, 2021

What type of PR is this?
/kind bug

What this PR does / why we need it:
When listing VMSSes in a resource group, Azure responds with values it has in the regional cache. Unfortunately, there are sometimes spikes in regional replication for these caches. To avoid cross region replication issues, this PR introduces a region specific client for AzureManageMachinePools by transforming the client baseURI from something like https://management.azure.com to https://{region}.management.azure.com to ensure the request is directed to the region where the VMSSes should be.

fixes #1720

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

use a region specific Azure client for listing VMSS in AzureManagedClusters

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 10, 2021
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 10, 2021
@devigned
Copy link
Contributor Author

devigned commented Nov 10, 2021

success 1, 20mins

/test pull-cluster-api-provider-azure-e2e-exp

@devigned
Copy link
Contributor Author

success 2, 22mins

/test pull-cluster-api-provider-azure-e2e-exp

@devigned
Copy link
Contributor Author

success 3, 23mins

/test pull-cluster-api-provider-azure-e2e-exp

if err != nil {
return "", errors.Wrap(err, "failed to parse the base URI of client")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@devigned
Copy link
Contributor Author

success 4, 23mins

/test pull-cluster-api-provider-azure-e2e-exp

@alexeldeib
Copy link
Contributor

23 min is right around what I’d expect for the total duration of this test. Looking good!

@alexeldeib
Copy link
Contributor

alexeldeib commented Nov 11, 2021

Actually, with the latest bits to reconcile pools all together on cluster create, I wonder if we can’t squeeze that lower. Delete is kind of slow though, so we’ll see what makes sense (unrelated to this PR)

@devigned
Copy link
Contributor Author

5 passes, 5th 23mins

@devigned devigned force-pushed the regional-vmss-client branch 3 times, most recently from c87eec2 to d47efd7 Compare November 11, 2021 13:47
@devigned
Copy link
Contributor Author

@CecileRobertMichon the PR history is now solid green for the exp job. I think we are solid. So solid that we may want to consider adding the AKS related tests to the regular pr e2e test.

@devigned
Copy link
Contributor Author

Actually, with the latest bits to reconcile pools all together on cluster create, I wonder if we can’t squeeze that lower. Delete is kind of slow though, so we’ll see what makes sense (unrelated to this PR)

I think with the transient error / requeue stuff that I added to the AzureManagedMachinePool reconciler, I think we are close to as good as we can get. The initial reconcile of AMMP only starts when control plane transitions to ready. Once ready, the reconciler will try to reach goal state persistently with requeues if it reaches what is deemed a transient error, a 404 for the VMSS other similar errors that will be resolved as infra comes online. I don't think this is the most elegant solution, but it seems to be resilient.

@devigned
Copy link
Contributor Author

success 7?, 22mins.

return a.aliasAuth.BaseURI()
}

sansScheme := path.Join(fmt.Sprintf("%s.%s", a.Region, a.parsedURL.Host), a.parsedURL.Path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want path.join() (os filepath separator) here, versus a hardcoded slash? Wouldn’t this now be platform dependent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, I might be thinking filepath.join()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeldeib
Copy link
Contributor

Before the PR which reconciled all AMMP with cluster creation, we would create the cluster, wait for it to reconcile fully, then trigger creation of nodepools which takes ~5 min. With that approach I would expect to see around 25min runs based on my previous testing.

since we don’t need that extra 5min anymore (all pools should be created), I wonder if we have room to squeeze the times lower somewhere, or if that’s just as good as it gets.

@alexeldeib
Copy link
Contributor

Nice work with the URL scheme/host/path parsing

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 11, 2021
@devigned
Copy link
Contributor Author

Before the PR which reconciled all AMMP with cluster creation, we would create the cluster, wait for it to reconcile fully, then trigger creation of nodepools which takes ~5 min. With that approach I would expect to see around 25min runs based on my previous testing.

since we don’t need that extra 5min anymore (all pools should be created), I wonder if we have room to squeeze the times lower somewhere, or if that’s just as good as it gets.

I like where this is going! I think we need to do some tinkering now that tests are more stable.

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, just one comment about using the Location() func

exp/controllers/azuremanagedmachinepool_reconciler.go Outdated Show resolved Hide resolved
return &azureManagedMachinePoolService{
scope: scope,
agentPoolsSvc: agentpools.New(scope),
scaleSetsSvc: scalesets.NewClient(scope),
}
scaleSetsSvc: scalesets.NewClient(authorizer),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2021
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2021
@k8s-ci-robot k8s-ci-robot merged commit feba723 into kubernetes-sigs:main Nov 12, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.1 milestone Nov 12, 2021
@devigned devigned deleted the regional-vmss-client branch November 13, 2021 16:17
@CecileRobertMichon
Copy link
Contributor

/cherry-pick release-1.0

@k8s-infra-cherrypick-robot

@CecileRobertMichon: #1850 failed to apply on top of branch "release-1.0":

Applying: use a regional base uri for making VMSS requests
Using index info to reconstruct a base tree...
M	exp/controllers/azuremanagedmachinepool_controller.go
M	exp/controllers/azuremanagedmachinepool_reconciler.go
M	templates/test/ci/cluster-template-prow-aks-multi-tenancy.yaml
M	templates/test/ci/prow-aks-multi-tenancy/kustomization.yaml
A	templates/test/ci/prow-aks-multi-tenancy/patch_location.yaml
Falling back to patching base and 3-way merge...
Auto-merging exp/controllers/azuremanagedmachinepool_reconciler.go
Auto-merging exp/controllers/azuremanagedmachinepool_controller.go
CONFLICT (content): Merge conflict in exp/controllers/azuremanagedmachinepool_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 use a regional base uri for making VMSS requests
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon
Copy link
Contributor

Ok that's fair @ cherry-pick bot :)

@devigned do you think we should cherry-pick this to release-1.0?

@devigned
Copy link
Contributor Author

devigned commented Nov 19, 2021

@devigned do you think we should cherry-pick this to release-1.0?

With the next release coming very soon and managed clusters being experimental, I lean toward not dealing with the cherry pick merge conflict.

@jackfrancis
Copy link
Contributor

/cherry-pick release-0.5

@k8s-infra-cherrypick-robot

@jackfrancis: #1850 failed to apply on top of branch "release-0.5":

Applying: use a regional base uri for making VMSS requests
Using index info to reconstruct a base tree...
M	exp/controllers/azuremanagedmachinepool_controller.go
M	exp/controllers/azuremanagedmachinepool_reconciler.go
M	templates/test/ci/cluster-template-prow-aks-multi-tenancy.yaml
M	templates/test/ci/prow-aks-multi-tenancy/kustomization.yaml
A	templates/test/ci/prow-aks-multi-tenancy/patch_location.yaml
Falling back to patching base and 3-way merge...
Auto-merging templates/test/ci/cluster-template-prow-aks-multi-tenancy.yaml
Auto-merging exp/controllers/azuremanagedmachinepool_reconciler.go
Auto-merging exp/controllers/azuremanagedmachinepool_controller.go
CONFLICT (content): Merge conflict in exp/controllers/azuremanagedmachinepool_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 use a regional base uri for making VMSS requests
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AKS test is flaky
6 participants