Implement drop-at-a-distance semantics. #734

DimCitus · 2021-06-18T17:34:07Z

This allows running pg_autoctl drop node --name foo from the monitor, or
even using the SQL API directly, and have the node realise it's been
dropped. And then stop.

The configuration file, state file, and PGDATA are not touched, for that the
command pg_autoctl drop node --pgdata ... --destroy can be used on the node
itself.

Fixes #690

JelteF

Issues found:

pg_autoctl drop node --destroy is indeed broken if done without first running pg_autoctl drop node
In a two node cluster pg_autoctl drop node results in no dropped node and two single nodes
We still need a way to remove a node from the monitor, even if the node is completely down. Maybe add a flag like --force.

src/bin/pg_autoctl/service_keeper.c

src/monitor/pgautofailover.sql

JelteF

(wrong button before)

JelteF · 2021-06-23T15:02:25Z

Problems found:

wait_standby -> single transition is broken (triggerd by dropping primary in two node cluster, when other node is wait_standby)
wait_standby -> dropped transition is broken

JelteF · 2021-06-23T15:12:34Z

When running drop on a postgres server (not the monitor), nothing happens for 30 seconds.
sync_state and WAL LSN are empty in the log

src/bin/pg_autoctl/fsm.c

DimCitus · 2021-06-23T16:06:58Z

When running drop on a postgres server (not the monitor), nothing happens for 30 seconds.

This is fixed now.

sync_state and WAL LSN are empty in the log

Yes, because Postgres is not running, but it shouldn't prevent us from dropping the node. That's fixed now too.

JelteF

I still get this log with empty values when I do (everything still works, but the log looks confusing):

stop node
drop node from monitor
pg_autoctl run or "drop node locally again" (drop node --pgdata node2)

10:34:52 29061 INFO  Calling node_active for node default/4/0 with current state: PostgreSQL is running is false, sync_state is "", latest WAL LSN is 0/0.

src/bin/pg_autoctl/keeper_pg_init.c

DimCitus · 2021-06-24T08:49:02Z

10:34:52 29061 INFO  Calling node_active for node default/4/0 with current state: PostgreSQL is running is false, sync_state is "", latest WAL LSN is 0/0.

Indeed, keeper_fsm_step didn't get the memo about this log output being a log_debug nowadays. Fixed, thanks!

JelteF

These two problems are still present

wait_standby -> single transition is broken (triggered by dropping primary in two node cluster, when other node is wait_standby)
wait_standby -> dropped transition is broken

I think it's fine to not fix issue 1, since it seems very rare. But issue 2 should be fixed, since it should be possible to drop a node that's currently in wait_standby mode.

src/bin/pg_autoctl/cli_drop_node.c

src/monitor/pgautofailover--1.5--1.6.sql

src/monitor/pgautofailover.sql

src/bin/pg_autoctl/monitor.c

src/bin/pg_autoctl/keeper.c

DimCitus · 2021-06-24T12:42:28Z

wait_standby -> dropped transition is broken

Fixed now.

JelteF · 2021-06-24T14:46:39Z

Fixed now.

Dropping a node while creating it still gives me errors. Now this:

src/monitor/pgautofailover--1.5--1.6.sql

DimCitus · 2021-06-24T15:07:47Z

Dropping a node while creating it still gives me errors. Now this:

You need to be quite fast to get that one. Then when trying I was even faster apparently and could drop the node in the step before that. Now fixed.

This allows running `pg_autoctl drop node --name foo` from the monitor, or even using the SQL API directly, and have the node realise it's been dropped. And then stop. The configuration file, state file, and PGDATA are not touched, for that the command pg_autoctl drop node --pgdata ... --destroy can be used on the node itself. TODO: review pg_autoctl drop node --destroy, probably broken?

- fix pg_autoctl drop node for a local Postgres (keeper) node - fix removing a node in a cluster of two nodes

When a node won't come back up again, the first call to pg_autoctl drop node sets the assigned role to DROPPED, but the node is not in a position to call node_active() to clean this entry. Now, another call to pg_autoctl drop node from the monitor allows to clean-up the node for real. If the node still comes back up again, the situation is properly detected and the command pg_autoctl create node --run is now able to re-register the node with its old nodeid and continue from there.

In particular dropping a local node that is running in the background was broken. Refactor and simplify the code to make all cases work: 1. stop local node and drop 2. drop local node while it's running in the background 3. drop node from the monitor while it's running 4. drop node from the monitor after having stopped it In case 4. we need to then resort to pg_autoctl drop node --force, as expected.

JelteF · 2021-06-24T15:39:27Z

If you're even faster you get this:

To reproduce:

Press Ctrl+C in the standby right when it is shown in the status window
Drop on the monitor
pg_autoctl run the standby again

.travis.yml

JelteF

Feel free to fix the last very quick drop issue I provided, but I don't think it's super necessary.

JelteF

found another bug that I think should be fixed:

To reproduce:

2 node cluster, wait until stable
Drop node2
Drop node1
recreate node1

DimCitus · 2021-06-24T16:06:18Z

2 node cluster, wait until stable

Drop node2

Drop node1

recreate node1

Fixed now.

JelteF

LGTM 🎉

DimCitus added enhancement New feature or request user experience Size:M Effort Estimate: Medium labels Jun 18, 2021

DimCitus added this to the Sprint 2021 W24 W25 milestone Jun 18, 2021

DimCitus requested a review from JelteF June 18, 2021 17:34

DimCitus self-assigned this Jun 18, 2021

JelteF approved these changes Jun 21, 2021

View reviewed changes

src/bin/pg_autoctl/service_keeper.c Outdated Show resolved Hide resolved

src/monitor/pgautofailover.sql Show resolved Hide resolved

JelteF suggested changes Jun 21, 2021

View reviewed changes

DimCitus force-pushed the feature/drop-state branch from 73600bb to 329c31e Compare June 21, 2021 19:05

DimCitus added Size:L Effort Estimate: Large and removed Size:M Effort Estimate: Medium labels Jun 22, 2021

DimCitus requested a review from JelteF June 22, 2021 16:23

JelteF reviewed Jun 23, 2021

View reviewed changes

src/bin/pg_autoctl/fsm.c Show resolved Hide resolved

JelteF suggested changes Jun 24, 2021

View reviewed changes

src/bin/pg_autoctl/keeper_pg_init.c Outdated Show resolved Hide resolved

JelteF suggested changes Jun 24, 2021

View reviewed changes

DimCitus requested a review from JelteF June 24, 2021 12:42

DimCitus added Size: XL Effort Estimate: eXtra Large and removed Size:L Effort Estimate: Large labels Jun 24, 2021

JelteF suggested changes Jun 24, 2021

View reviewed changes

src/monitor/pgautofailover--1.5--1.6.sql Outdated Show resolved Hide resolved

DimCitus added 4 commits June 24, 2021 17:11

Fix pg_autoctl drop node with the new DROPPED state.

8bd59e7

Assorted fixes.

0988dcd

- fix pg_autoctl drop node for a local Postgres (keeper) node - fix removing a node in a cluster of two nodes

Fix pg_autoctl drop monitor [ --destroy ]

f4e7782

DimCitus added 18 commits June 24, 2021 17:11

Allow pg_autoctl create node on top of a dropped node.

739071b

Update docs, fix another bug.

e212986

Fix make installcheck.

7bca12f

Clean-up a compiler warning from Linux builds.

4fd7575

Fix the case when the monitor is disabled.

9d10159

Add the DROPPED state in the FSM documentation.

946f215

Per review, introduce pg_autoctl drop node --force.

ccb08ff

Travis: work around shutil.which("pg_autoctl") returning None sometimes.

4c51582

Per review.

279fa01

Fix the Travis work-around for shutil.which() failures.

988924d

Per review.

6e5e283

Review and fix/finish the extension upgrade script to 1.6.

8073fa1

Per review, fix dropping a node during wait_standby to catchingup.

9b61091

Allow Postgres 14 failures.

e7b2905

Improve handling of node being drop during init phases, per review.

6d2ffee

Fix extension update script (no CASCADE in there).

f9c63be

DimCitus force-pushed the feature/drop-state branch from 85c6eb7 to f9c63be Compare June 24, 2021 15:11

DimCitus requested a review from JelteF June 24, 2021 15:13

JelteF reviewed Jun 24, 2021

View reviewed changes

.travis.yml Show resolved Hide resolved

JelteF approved these changes Jun 24, 2021

View reviewed changes

JelteF suggested changes Jun 24, 2021

View reviewed changes

Per review, allow re-creating a primary node from a dropped node.

b322f92

JelteF approved these changes Jun 24, 2021

View reviewed changes

DimCitus merged commit 99ae2b8 into master Jun 24, 2021

DimCitus deleted the feature/drop-state branch June 24, 2021 16:54

DimCitus mentioned this pull request Jul 7, 2021

Restart logic counts intended restarts as bad exit #327

Open

Implement drop-at-a-distance semantics. #734

Implement drop-at-a-distance semantics. #734

Uh oh!

Conversation

DimCitus commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JelteF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JelteF left a comment

Choose a reason for hiding this comment

Uh oh!

JelteF commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JelteF commented Jun 23, 2021

Uh oh!

Uh oh!

DimCitus commented Jun 23, 2021

Uh oh!

JelteF left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DimCitus commented Jun 24, 2021

Uh oh!

JelteF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DimCitus commented Jun 24, 2021

Uh oh!

JelteF commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

DimCitus commented Jun 24, 2021

Uh oh!

JelteF commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JelteF left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JelteF left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DimCitus commented Jun 24, 2021

Uh oh!

JelteF left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DimCitus commented Jun 18, 2021 •

edited

Loading

JelteF commented Jun 23, 2021 •

edited

Loading

JelteF left a comment •

edited

Loading

JelteF commented Jun 24, 2021 •

edited

Loading

JelteF commented Jun 24, 2021 •

edited

Loading

JelteF left a comment •

edited

Loading

JelteF left a comment •

edited

Loading