docs: added documentation for usage of failure domains #3173

Ankitasw · 2022-02-08T15:32:10Z

What type of PR is this?
/kind documentation

What this PR does / why we need it:
This PR adds the documentation for usage of failure domains in CAPA for control plane and worker nodes.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2924

Checklist:

squashed commits
includes documentation
adds unit tests
adds or updates e2e tests

Release note:

Added documentation for usage of failure domains in control planes and worker nodes

k8s-ci-robot · 2022-02-08T15:32:18Z

@Ankitasw: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sedefsavas · 2022-02-08T23:02:25Z

I think we can combine this with control plane AZ support:
https://cluster-api-aws.sigs.k8s.io/topics/multi-az-control-planes.html
Maybe under a single section Failure Domains, with 2 subsections workers and control-plane. WDYT?

Also, this needs to be added to SUMMARY_PREFIX.md to be added to the book index. You can check locally how changes look by running make serve-book.

Ankitasw · 2022-02-09T05:28:19Z

I think we can combine this with control plane AZ support:
https://cluster-api-aws.sigs.k8s.io/topics/multi-az-control-planes.html
Maybe under a single section Failure Domains, with 2 subsections workers and control-plane. WDYT?

So do we want to cover single-AZ and multi AZ support in control plane and worker nodes in same doc, i.e, Failure Domains ?

sedefsavas · 2022-02-09T05:32:21Z

Not the same file but section would be better, otherwise will be too long. My above comment:

Maybe under a single section Failure Domains, with 2 subsections workers and control-plane. WDYT?

Ankitasw · 2022-02-09T13:02:32Z

Not the same file but section would be better, otherwise will be too long.

@sedefsavas Arranged the sections, please have a look.

enxebre · 2022-02-09T13:12:11Z

docs/book/src/topics/failure-domains/worker-nodes.md

@@ -0,0 +1,132 @@
+# Failure domains in worker nodes
+
+To ensure that the worker machines are spread across failure domains, we need to create N `MachineDeployment` for your N failure domains, scaling them independently. Resiliency to failures comes from having multiple `MachineDeployment`.


Do we want to elaborate here how .failureDomain and .Subnet relate?
e.g
We have 2 sources for subnets:

If subnet.id or subnet.filters are specified, we directly query AWS

All other cases use the subnets provided in the cluster network spec without ever calling AWS

Relates to #2864
cc @codablock

It's been present in control-planes. Do we want to add similar explanation for worker nodes as well?

ah thanks @Ankitasw, I had missed that. My +1 to present it in a way that's clear that apply to both cp and worker nodes as well.

sure @enxebre I will summarize it for both cp and worker nodes.

Added a note here, so that we know it applies for worker nodes as well

richardcase · 2022-02-11T10:59:59Z

/lgtm

sedefsavas · 2022-02-14T08:21:26Z

/approve

k8s-ci-robot · 2022-02-14T08:21:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sedefsavas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sedefsavas]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from sedefsavas and shivi28 February 8, 2022 15:32

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 8, 2022

Ankitasw changed the title ~~docs: added documentation for usage of failure domains~~ [WIP] docs: added documentation for usage of failure domains Feb 9, 2022

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2022

Ankitasw force-pushed the failure-domain-doc branch from 01b77c7 to 373ecc2 Compare February 9, 2022 12:54

Ankitasw changed the title ~~[WIP] docs: added documentation for usage of failure domains~~ docs: added documentation for usage of failure domains Feb 9, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2022

Ankitasw force-pushed the failure-domain-doc branch 4 times, most recently from f26fe8d to 6200cb4 Compare February 9, 2022 13:00

enxebre reviewed Feb 9, 2022

View reviewed changes

Ankitasw force-pushed the failure-domain-doc branch from 6200cb4 to 61fdf9f Compare February 9, 2022 15:07

Add documentation for usage of failure domains

92e7bcb

Ankitasw force-pushed the failure-domain-doc branch from 61fdf9f to 92e7bcb Compare February 9, 2022 15:20

Ankitasw requested a review from enxebre February 10, 2022 10:24

k8s-ci-robot assigned richardcase Feb 11, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2022

k8s-ci-robot merged commit 3666c9f into kubernetes-sigs:main Feb 14, 2022

k8s-ci-robot added this to the v1.x milestone Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: added documentation for usage of failure domains #3173

docs: added documentation for usage of failure domains #3173

Ankitasw commented Feb 8, 2022 •

edited

k8s-ci-robot commented Feb 8, 2022

sedefsavas commented Feb 8, 2022

Ankitasw commented Feb 9, 2022

sedefsavas commented Feb 9, 2022

Ankitasw commented Feb 9, 2022

enxebre Feb 9, 2022

Ankitasw Feb 9, 2022 •

edited

enxebre Feb 9, 2022

Ankitasw Feb 9, 2022 •

edited

Ankitasw Feb 9, 2022 •

edited

richardcase commented Feb 11, 2022

sedefsavas commented Feb 14, 2022

k8s-ci-robot commented Feb 14, 2022

		@@ -0,0 +1,132 @@
		# Failure domains in worker nodes

		To ensure that the worker machines are spread across failure domains, we need to create N `MachineDeployment` for your N failure domains, scaling them independently. Resiliency to failures comes from having multiple `MachineDeployment`.

docs: added documentation for usage of failure domains #3173

docs: added documentation for usage of failure domains #3173

Conversation

Ankitasw commented Feb 8, 2022 • edited

k8s-ci-robot commented Feb 8, 2022

sedefsavas commented Feb 8, 2022

Ankitasw commented Feb 9, 2022

sedefsavas commented Feb 9, 2022

Ankitasw commented Feb 9, 2022

enxebre Feb 9, 2022

Choose a reason for hiding this comment

Ankitasw Feb 9, 2022 • edited

Choose a reason for hiding this comment

enxebre Feb 9, 2022

Choose a reason for hiding this comment

Ankitasw Feb 9, 2022 • edited

Choose a reason for hiding this comment

Ankitasw Feb 9, 2022 • edited

Choose a reason for hiding this comment

richardcase commented Feb 11, 2022

sedefsavas commented Feb 14, 2022

k8s-ci-robot commented Feb 14, 2022

Ankitasw commented Feb 8, 2022 •

edited

Ankitasw Feb 9, 2022 •

edited

Ankitasw Feb 9, 2022 •

edited

Ankitasw Feb 9, 2022 •

edited