Implement online enable/disable monitor support. #591

DimCitus · 2021-02-15T17:10:29Z

This allows for adjusting to a lost monitor while keeping our services
online. At run-time, we can now

$ pg_autoctl disable monitor
$ pg_autoctl enable monitor --monitor ...

Taking care of the order of operations, the cluster can continue behaving
correctly with a minimum amount of disturbance.

marcocitus · 2021-02-18T11:12:36Z

src/bin/pg_autoctl/keeper.c

+	}
+
+	NodeState initialState =
+		pgSetup->is_in_recovery ? WAIT_STANDBY_STATE : SINGLE_STATE;


There might be a primary and a demoted primary when the monitor is enabled. How does the initialState affect things?

You are pointing exactly to why we don't have a fully automated way to register nodes again to a new monitor. To solve the problem, we ask the user to be careful about the node registration ordering. Once a primary has been registered, the other primary might be more interesting to support.

I suppose on this case we are back to the previous situation where the current state needs to be dropped and a new node created from scratch again:

$ pg_autoctl stop $ pg_autoctl drop node --pgdata ... $ pg_autoctl create postgres --pgdata ... ...

But that's only when you're in the unfortunate position of losing the monitor node just after having lost a primary node and then you now have all the nodes back online and you're dealing with all the pieces.

I think to avoid this issue we should not to use is_in_recovery to determine in which state the node should be registered. I think it makes more sense to do this based on the current state of the node.

if state is primary, wait_primary or single (and maybe draining) we should try to join as single.

any other state we should try to join as wait_standby

I think there is a scenario where it's better to use is_in_recovery: when deploying to two regions for disaster recovery purposes, and the first (primary) region has just been lost, then we want to promote the only node left in the DR region. It's possible to do so with:

pg_autoctl disable monitor

pg_ctl promote (or call the SQL function)

start a new monitor

pg_autoctl enable monitor

At this point the FSM state on-disk is not relevant anymore. It can be argued that one could use pg_autoctl do fsm assign single in step 2, though this is not documented at the moment, and required PG_AUTOCTL_DEBUG=1 in the environment.

docs/operations.rst

src/bin/pg_autoctl/cli_enable_disable.c

tests/test_replace_monitor.py

tests/pgautofailover_utils.py

src/bin/pg_autoctl/pgsql.c

JelteF · 2021-02-23T12:38:29Z

src/bin/pg_autoctl/keeper.c

+	}
+
+	NodeState initialState =
+		pgSetup->is_in_recovery ? WAIT_STANDBY_STATE : SINGLE_STATE;


I think to avoid this issue we should not to use is_in_recovery to determine in which state the node should be registered. I think it makes more sense to do this based on the current state of the node.

if state is primary, wait_primary or single (and maybe draining) we should try to join as single.

any other state we should try to join as wait_standby

src/bin/pg_autoctl/cli_enable_disable.c

This allows for adjusting to a lost monitor while keeping our services online. At run-time, we can now $ pg_autoctl disable monitor $ pg_autoctl enable monitor --monitor ... Taking care of the order of operations, the cluster can continue behaving correctly with a minimum amount of disturbance.

Reduce each matrix job to a smaller work unit, it seems Travis is having trouble again with too many things in a single work unit.

DimCitus added the Size:M Effort Estimate: Medium label Feb 15, 2021

DimCitus added this to the Sprint 2021 W6 W7 milestone Feb 15, 2021

DimCitus requested a review from JelteF February 15, 2021 17:10

DimCitus self-assigned this Feb 15, 2021

marcocitus reviewed Feb 18, 2021

View reviewed changes

JelteF suggested changes Feb 23, 2021

View reviewed changes

DimCitus modified the milestones: Sprint 2021 W6 W7, Sprint 2021 W9 W10 Mar 1, 2021

DimCitus force-pushed the feature/reset-state branch from c6804e0 to 8e72f48 Compare March 2, 2021 09:28

DimCitus force-pushed the feature/reset-state branch from 8e72f48 to b55426f Compare March 9, 2021 14:16

DimCitus modified the milestones: Sprint 2021 W9 W10, Sprint 2021 W11 W12 Mar 15, 2021

DimCitus requested a review from JelteF March 17, 2021 10:53

DimCitus force-pushed the feature/reset-state branch from 14d548f to dc4e057 Compare March 17, 2021 11:08

DimCitus added 11 commits March 18, 2021 16:18

Improve pg_autoctl show file --state --contents --json.

ab1543b

Add documentation about the new feature.

d3ebd92

Add test coverage for enable/disable monitor.

0c538ef

Per review.

865917f

Fix rebasing on origin/master (default citus cluster name).

0b0df7e

Clarify operations docs for pg_autoctl enable monitor.

2de0ad3

Make it possible to re-use the same nodeid at enable monitor time.

3977d46

Fix version number in the expected output file for the monitor tests.

be8d14b

Review Travis matrix splitting.

185affb

Reduce each matrix job to a smaller work unit, it seems Travis is having trouble again with too many things in a single work unit.

Rearrange Travis jobs matrix to speed up testing (maybe).

a869445

DimCitus force-pushed the feature/reset-state branch from 8eca60c to a869445 Compare March 18, 2021 15:18

Fix unit tests for latest changes (synchronous_standby_names wildcard).

28b55a9

JelteF approved these changes Mar 22, 2021

View reviewed changes

DimCitus merged commit 78132d1 into master Mar 22, 2021

DimCitus deleted the feature/reset-state branch March 22, 2021 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement online enable/disable monitor support. #591

Implement online enable/disable monitor support. #591

Uh oh!

DimCitus commented Feb 15, 2021 •

edited

Loading

Uh oh!

marcocitus Feb 18, 2021

Uh oh!

DimCitus Feb 18, 2021

Uh oh!

JelteF Feb 23, 2021

Uh oh!

DimCitus Mar 1, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JelteF Feb 23, 2021

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Implement online enable/disable monitor support. #591

Implement online enable/disable monitor support. #591

Uh oh!

Conversation

DimCitus commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcocitus Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

DimCitus Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

JelteF Feb 23, 2021

Choose a reason for hiding this comment

Uh oh!

DimCitus Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JelteF Feb 23, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DimCitus commented Feb 15, 2021 •

edited

Loading