-
Notifications
You must be signed in to change notification settings - Fork 129
Implement drop-at-a-distance semantics. #734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issues found:
pg_autoctl drop node --destroyis indeed broken if done without first runningpg_autoctl drop node- In a two node cluster
pg_autoctl drop noderesults in no dropped node and twosinglenodes - We still need a way to remove a node from the monitor, even if the node is completely down. Maybe add a flag like
--force.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(wrong button before)
73600bb to
329c31e
Compare
This is fixed now.
Yes, because Postgres is not running, but it shouldn't prevent us from dropping the node. That's fixed now too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still get this log with empty values when I do (everything still works, but the log looks confusing):
- stop node
- drop node from monitor
pg_autoctl runor "drop node locally again" (drop node --pgdata node2)
10:34:52 29061 INFO Calling node_active for node default/4/0 with current state: PostgreSQL is running is false, sync_state is "", latest WAL LSN is 0/0.
Indeed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two problems are still present
- wait_standby -> single transition is broken (triggered by dropping primary in two node cluster, when other node is wait_standby)
- wait_standby -> dropped transition is broken

I think it's fine to not fix issue 1, since it seems very rare. But issue 2 should be fixed, since it should be possible to drop a node that's currently in wait_standby mode.
Fixed now. |
You need to be quite fast to get that one. Then when trying I was even faster apparently and could drop the node in the step before that. Now fixed. |
This allows running `pg_autoctl drop node --name foo` from the monitor, or even using the SQL API directly, and have the node realise it's been dropped. And then stop. The configuration file, state file, and PGDATA are not touched, for that the command pg_autoctl drop node --pgdata ... --destroy can be used on the node itself. TODO: review pg_autoctl drop node --destroy, probably broken?
- fix pg_autoctl drop node for a local Postgres (keeper) node - fix removing a node in a cluster of two nodes
When a node won't come back up again, the first call to pg_autoctl drop node sets the assigned role to DROPPED, but the node is not in a position to call node_active() to clean this entry. Now, another call to pg_autoctl drop node from the monitor allows to clean-up the node for real. If the node still comes back up again, the situation is properly detected and the command pg_autoctl create node --run is now able to re-register the node with its old nodeid and continue from there.
In particular dropping a local node that is running in the background was broken. Refactor and simplify the code to make all cases work: 1. stop local node and drop 2. drop local node while it's running in the background 3. drop node from the monitor while it's running 4. drop node from the monitor after having stopped it In case 4. we need to then resort to pg_autoctl drop node --force, as expected.
85c6eb7 to
f9c63be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to fix the last very quick drop issue I provided, but I don't think it's super necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
🎉




This allows running
pg_autoctl drop node --name foofrom the monitor, oreven using the SQL API directly, and have the node realise it's been
dropped. And then stop.
The configuration file, state file, and PGDATA are not touched, for that the
command pg_autoctl drop node --pgdata ... --destroy can be used on the node
itself.
Fixes #690