Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Azure machine pool rolling upgrades with MaxSurge, MaxUnavailable and DeletePolicy #1332

Merged
merged 5 commits into from Jun 9, 2021

Conversation

devigned
Copy link
Contributor

@devigned devigned commented Apr 22, 2021

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR introduces AzureMachinePoolMachine resource to replace AzureMachinePool.Status.Instances in accordance to Azure Machine Pool Machines Proposal.

This also introduces the machine pool rolling upgrade strategy which provides, MaxSurge, MaxUnavailable, and DeletePolicy modeled after machine deployments.

Functional changes:

  • Ability to delete individual AzureMachinePoolMachines
  • MaxSurge to surge machine pools upon upgrade of the machine pool template
  • MaxUnavailable to ensure a minimum number of machines are available at any time
  • DeletePolicy to enable preference on which nodes to delete, Oldest, Newest, Random
  • Fully async reconciliation of AzureMachinePool resources

Non-goals of the PR (follow up PRs):

  • implement cordon and drain per Azure machine pool machine
  • implement instance update rather than delete / re-create

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
related #1105
fixes #1361

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

add Azure machine pool rolling upgrades with MaxSurge, MaxUnavailable and DeletePolicy. 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 22, 2021
@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Apr 22, 2021
@devigned devigned changed the title add AzureMachinePoolMachine resource to migrate AzureMachinePool add Azure machine pool rolling upgrades with MaxSurge, MaxUnavailable and DeletePolicy Apr 22, 2021
@devigned
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2021
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing but I'm having a hard time following all the different changes... It would greatly help review if you were able to split out any unrelated improvements to other PRs and structure the changes in this PR in multiple commits.

api/v1alpha4/conditions_consts.go Outdated Show resolved Hide resolved
azure/scope/machinepool.go Outdated Show resolved Hide resolved
azure/scope/machinepool.go Outdated Show resolved Hide resolved
azure/scope/machinepool.go Outdated Show resolved Hide resolved
azure/scope/machinepool.go Outdated Show resolved Hide resolved
exp/api/v1alpha4/azuremachinepool_types.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 4, 2021
@devigned devigned force-pushed the surge-take2 branch 5 times, most recently from 6409b27 to 1c493b9 Compare May 4, 2021 15:20
@devigned devigned mentioned this pull request May 4, 2021
3 tasks
@devigned devigned force-pushed the surge-take2 branch 2 times, most recently from 4cdbbd0 to 81f945b Compare May 4, 2021 19:02
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still working through reviewing all the files... thanks for splitting up commits

pkg/coalescing/mocks/doc.go Outdated Show resolved Hide resolved
test/e2e/helpers.go Outdated Show resolved Hide resolved
test/e2e/helpers.go Outdated Show resolved Hide resolved
api/v1alpha4/conditions_consts.go Outdated Show resolved Hide resolved
azure/scope/machinepool.go Show resolved Hide resolved
}

// ProviderID returns the K8s provider ID for the VMSS instance
func (vm VMSSVM) ProviderID() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this stuff really belong in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to put them where ever. They definitely didn't belong in the ./exp/api where they were before.

I do think they belong here as they are basically the spec structs which have yet to be refactored for the exp controllers.

azure/types.go Outdated Show resolved Hide resolved
@nader-ziada
Copy link
Contributor

@nader-ziada and @CecileRobertMichon is there anything else you'd like to see changed in this PR?

lgtm

@CecileRobertMichon
Copy link
Contributor

I'll take another pass by EOD

api/v1alpha4/types.go Outdated Show resolved Hide resolved
api/v1alpha4/types.go Outdated Show resolved Hide resolved
azure/converters/vmss.go Outdated Show resolved Hide resolved
azure/converters/vmss.go Show resolved Hide resolved
azure/errors.go Outdated Show resolved Hide resolved
exp/controllers/azuremachinepoolmachine_controller.go Outdated Show resolved Hide resolved
if err = controllers.NewAzureMachineReconciler(
mgr.GetClient(),
ctrl.Log.WithName("controllers").WithName("AzureMachine"),
if err := mgr.AddReadyzCheck("ping", healthz.Ping); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it normal that readyz check takes the same args as healthz check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I was just keeping what is already in HEAD.

if err := mgr.AddReadyzCheck("ping", healthz.Ping); err != nil {
setupLog.Error(err, "unable to create ready check")
os.Exit(1)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/coalescing/reconciler.go Outdated Show resolved Hide resolved
test/e2e/config/azure-dev.yaml Outdated Show resolved Hide resolved
@@ -162,6 +162,12 @@ metadata:
spec:
identity: SystemAssigned
location: ${AZURE_LOCATION}
strategy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these required changes to the templates or are they illustrative / used as examples? I don't see any docs updates or additions in this PR, what parts of the PR do you think are user-facing changes that need documentation updates?

Copy link
Contributor Author

@devigned devigned Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not required. I only added them for illustrative purposes.

The item that needs documentation is the rollout and delete policies. I'll create an issue to address rather than adding to an already enormous PR.

#1427

@CecileRobertMichon
Copy link
Contributor

test failed because of quota issues in centralus: #1428

@CecileRobertMichon
Copy link
Contributor

this fixes #819 and #1361, correct?

@CecileRobertMichon
Copy link
Contributor

/retest

@CecileRobertMichon
Copy link
Contributor

/lgtm

and, given @nader-ziada's comment above

/approve

/pony

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2021
@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: pony image

In response to this:

/lgtm

and, given @nader-ziada's comment above

/approve

/pony

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2021
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 9, 2021

@devigned: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-provider-azure-apidiff f96fba5 link /test pull-cluster-api-provider-azure-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@CecileRobertMichon CecileRobertMichon linked an issue Jun 9, 2021 that may be closed by this pull request
@devigned
Copy link
Contributor Author

devigned commented Jun 9, 2021

this fixes #819 and #1361, correct?

#1361, yes. But #819, no, not until cordon and drain are added to the AzureMachinePoolMachine delete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MachinePool has more worker nodes than the desired value
4 participants