Skip to content

Conversation

@DimCitus
Copy link
Collaborator

@DimCitus DimCitus commented Jun 18, 2021

This allows running pg_autoctl drop node --name foo from the monitor, or
even using the SQL API directly, and have the node realise it's been
dropped. And then stop.

The configuration file, state file, and PGDATA are not touched, for that the
command pg_autoctl drop node --pgdata ... --destroy can be used on the node
itself.

Fixes #690

@DimCitus DimCitus added enhancement New feature or request user experience Size:M Effort Estimate: Medium labels Jun 18, 2021
@DimCitus DimCitus added this to the Sprint 2021 W24 W25 milestone Jun 18, 2021
@DimCitus DimCitus requested a review from JelteF June 18, 2021 17:34
@DimCitus DimCitus self-assigned this Jun 18, 2021
Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issues found:

  1. pg_autoctl drop node --destroy is indeed broken if done without first running pg_autoctl drop node
  2. In a two node cluster pg_autoctl drop node results in no dropped node and two single nodes
  3. We still need a way to remove a node from the monitor, even if the node is completely down. Maybe add a flag like --force.

Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(wrong button before)

@DimCitus DimCitus force-pushed the feature/drop-state branch from 73600bb to 329c31e Compare June 21, 2021 19:05
@DimCitus DimCitus added Size:L Effort Estimate: Large and removed Size:M Effort Estimate: Medium labels Jun 22, 2021
@DimCitus DimCitus requested a review from JelteF June 22, 2021 16:23
@JelteF
Copy link
Contributor

JelteF commented Jun 23, 2021

Problems found:

  1. wait_standby -> single transition is broken (triggerd by dropping primary in two node cluster, when other node is wait_standby)
  2. wait_standby -> dropped transition is broken
    afbeelding

@JelteF
Copy link
Contributor

JelteF commented Jun 23, 2021

  1. When running drop on a postgres server (not the monitor), nothing happens for 30 seconds.
  2. sync_state and WAL LSN are empty in the log
    afbeelding

@DimCitus
Copy link
Collaborator Author

  • When running drop on a postgres server (not the monitor), nothing happens for 30 seconds.

This is fixed now.

  • sync_state and WAL LSN are empty in the log

Yes, because Postgres is not running, but it shouldn't prevent us from dropping the node. That's fixed now too.

Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still get this log with empty values when I do (everything still works, but the log looks confusing):

  1. stop node
  2. drop node from monitor
  3. pg_autoctl run or "drop node locally again" (drop node --pgdata node2)
10:34:52 29061 INFO  Calling node_active for node default/4/0 with current state: PostgreSQL is running is false, sync_state is "", latest WAL LSN is 0/0.

@DimCitus
Copy link
Collaborator Author

10:34:52 29061 INFO  Calling node_active for node default/4/0 with current state: PostgreSQL is running is false, sync_state is "", latest WAL LSN is 0/0.

Indeed, keeper_fsm_step didn't get the memo about this log output being a log_debug nowadays. Fixed, thanks!

Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two problems are still present

  1. wait_standby -> single transition is broken (triggered by dropping primary in two node cluster, when other node is wait_standby)
  2. wait_standby -> dropped transition is broken
    afbeelding

I think it's fine to not fix issue 1, since it seems very rare. But issue 2 should be fixed, since it should be possible to drop a node that's currently in wait_standby mode.

@DimCitus
Copy link
Collaborator Author

  1. wait_standby -> dropped transition is broken

Fixed now.

@DimCitus DimCitus requested a review from JelteF June 24, 2021 12:42
@DimCitus DimCitus added Size: XL Effort Estimate: eXtra Large and removed Size:L Effort Estimate: Large labels Jun 24, 2021
@JelteF
Copy link
Contributor

JelteF commented Jun 24, 2021

Fixed now.

Dropping a node while creating it still gives me errors. Now this:

afbeelding

@DimCitus
Copy link
Collaborator Author

Dropping a node while creating it still gives me errors. Now this:

You need to be quite fast to get that one. Then when trying I was even faster apparently and could drop the node in the step before that. Now fixed.

DimCitus added 4 commits June 24, 2021 17:11
This allows running `pg_autoctl drop node --name foo` from the monitor, or
even using the SQL API directly, and have the node realise it's been
dropped. And then stop.

The configuration file, state file, and PGDATA are not touched, for that the
command pg_autoctl drop node --pgdata ... --destroy can be used on the node
itself.

TODO: review pg_autoctl drop node --destroy, probably broken?
  - fix pg_autoctl drop node for a local Postgres (keeper) node
  - fix removing a node in a cluster of two nodes
DimCitus added 18 commits June 24, 2021 17:11
When a node won't come back up again, the first call to pg_autoctl drop node
sets the assigned role to DROPPED, but the node is not in a position to call
node_active() to clean this entry.

Now, another call to pg_autoctl drop node from the monitor allows to
clean-up the node for real.

If the node still comes back up again, the situation is properly detected
and the command pg_autoctl create node --run is now able to re-register the
node with its old nodeid and continue from there.
In particular dropping a local node that is running in the background was
broken. Refactor and simplify the code to make all cases work:

  1. stop local node and drop
  2. drop local node while it's running in the background
  3. drop node from the monitor while it's running
  4. drop node from the monitor after having stopped it

In case 4. we need to then resort to pg_autoctl drop node --force, as expected.
@DimCitus DimCitus force-pushed the feature/drop-state branch from 85c6eb7 to f9c63be Compare June 24, 2021 15:11
@DimCitus DimCitus requested a review from JelteF June 24, 2021 15:13
@JelteF
Copy link
Contributor

JelteF commented Jun 24, 2021

If you're even faster you get this:

afbeelding

To reproduce:

  1. Press Ctrl+C in the standby right when it is shown in the status window
  2. Drop on the monitor
  3. pg_autoctl run the standby again

Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to fix the last very quick drop issue I provided, but I don't think it's super necessary.

Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found another bug that I think should be fixed:
afbeelding

To reproduce:

  1. 2 node cluster, wait until stable
  2. Drop node2
  3. Drop node1
  4. recreate node1

@DimCitus
Copy link
Collaborator Author

  1. 2 node cluster, wait until stable
  2. Drop node2
  3. Drop node1
  4. recreate node1

Fixed now.

Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit: 🎉

@DimCitus DimCitus merged commit 99ae2b8 into master Jun 24, 2021
@DimCitus DimCitus deleted the feature/drop-state branch June 24, 2021 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Size: XL Effort Estimate: eXtra Large user experience

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dropped node doesn't realise it is dropped

3 participants