Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for cluster-level coordination setting changes #6336

Merged
merged 14 commits into from
Feb 2, 2024

Conversation

niyatiagg
Copy link
Contributor

Description

Adds documentation for the Leader check timeout and follower check timeout to be set dynamically through cluster update API

Issues Resolved

Closes #5603

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Niyati Aggarwal <niyatiagg4641@gmail.com>
@hdhalter hdhalter added 4 - Doc review PR: Doc review in progress release-notes PR: Include this PR in the automated release notes v2.12.0 labels Feb 1, 2024
@hdhalter hdhalter changed the title Adding documentation for cluster-level coordination setting changes Add documentation for cluster-level coordination setting changes Feb 1, 2024
Copy link
Collaborator

@kolchfa-aws kolchfa-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @niyatiagg! A couple of changes and then we'll move the PR to editorial review.

niyatiagg and others added 4 commits February 1, 2024 12:48
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Niyati Aggarwal <121826855+niyatiagg@users.noreply.github.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Niyati Aggarwal <121826855+niyatiagg@users.noreply.github.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Niyati Aggarwal <121826855+niyatiagg@users.noreply.github.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Niyati Aggarwal <121826855+niyatiagg@users.noreply.github.com>
@niyatiagg
Copy link
Contributor Author

Thank you, @niyatiagg! A couple of changes and then we'll move the PR to editorial review.

Thank you for the suggestions!

Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @niyatiagg for the documentation. LGTM overall with few suggestions.
@jainankitk can you take another pass? And also do you think we should include a note saying changing the settings can cause the cluster to become unstable?

OpenSearch supports the following cluster-level coordination settings. All settings in this list are dynamic:

- `cluster.fault_detection.leader_check.timeout` (Time unit): Determines the amount of time a node waits for a response from the cluster manager during a leader check before deeming the check as a failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also mention the default timeout here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let us mention the default and min/max values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niyatiagg you can include Changing this setting from the default value may cause an unstable cluster. as the last sentence here.

- `cluster.fault_detection.leader_check.timeout` (Time unit): Determines the amount of time a node waits for a response from the cluster manager during a leader check before deeming the check as a failure.

- `cluster.fault_detection.follower_check.timeout` (Time unit): Determines the amount of time the cluster manager waits for a response during a follower check before deeming the check as a failure.
Copy link
Member

@owaiskazi19 owaiskazi19 Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a default timeout here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let us mention the default and min/max values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well Changing this setting from the default value may cause an unstable cluster. last sentence.

OpenSearch supports the following cluster-level coordination settings. All settings in this list are dynamic:

- `cluster.fault_detection.leader_check.timeout` (Time unit): Determines the amount of time a node waits for a response from the cluster manager during a leader check before deeming the check as a failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `cluster.fault_detection.leader_check.timeout` (Time unit): Determines the amount of time a node waits for a response from the cluster manager during a leader check before deeming the check as a failure.
- `cluster.fault_detection.leader_check.timeout` (Time unit): Determines the amount of time a node waits for a response from the elected cluster manager during a leader check before deeming the check as a failure.

- `cluster.fault_detection.leader_check.timeout` (Time unit): Determines the amount of time a node waits for a response from the cluster manager during a leader check before deeming the check as a failure.

- `cluster.fault_detection.follower_check.timeout` (Time unit): Determines the amount of time the cluster manager waits for a response during a follower check before deeming the check as a failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `cluster.fault_detection.follower_check.timeout` (Time unit): Determines the amount of time the cluster manager waits for a response during a follower check before deeming the check as a failure.
- `cluster.fault_detection.follower_check.timeout` (Time unit): Determines the amount of time the elected cluster manager waits for a response during a follower check before deeming the check as a failure.

@jainankitk
Copy link
Contributor

Thanks @niyatiagg for the documentation. LGTM overall with few suggestions. @jainankitk can you take another pass?

Agree with specifying default value.

And also do you think we should include a note saying changing the settings can cause the cluster to become unstable?

Yeah, that can prevent customers from tweaking this unknowingly. Changing this setting from the default value can result into unstable cluster.

Signed-off-by: Niyati Aggarwal <niyatiagg4641@gmail.com>
Signed-off-by: Niyati Aggarwal <niyatiagg4641@gmail.com>
Signed-off-by: Niyati Aggarwal <niyatiagg4641@gmail.com>
Signed-off-by: Niyati Aggarwal <niyatiagg4641@gmail.com>
Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Just a couple of words to remove. Thanks!

kolchfa-aws and others added 2 commits February 2, 2024 11:20
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
@kolchfa-aws kolchfa-aws merged commit 3a01797 into opensearch-project:main Feb 2, 2024
3 checks passed
@hdhalter hdhalter added 3 - Done Issue is done/complete and removed 4 - Doc review PR: Doc review in progress labels Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Done Issue is done/complete release-notes PR: Include this PR in the automated release notes v2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] Update documentation for cluster settings to include leader/checker timeout setting
7 participants