New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postgres replica node restarts with blank data directory #385
Comments
|
cc: @stormmore |
|
When a previous master is becoming a new Standby, it may throw error due to conflict in timeline.
PG_REWIND will also help to find if any WAL is missing from primary or standby, to resume the standby. |
|
Why pg_rewind? refs:
Why don't just pull all WAL file?
pw_rewind. Possible error:
|
|
|
We're also experiencing this problem and getting the database wiped. We're running with one primary and two replicas. |
|
I can now confirm that using WAL-G archiving does not prevent the data loss. Setup: one primary and two replicas. |
Install : Few essential commands to test failover [wip]: controlled failover- Manually change leader [maybe, for maintanance]: |
I spent the last two days testing out Zalando's postgres operator (which uses the spilo image (also theirs)) and I can't say that I'm super happy. Sure, the failover setup seems to work nicely without nuking my data (which kubedb does), but trying to use cloning from wal-g and things does not work great. The python code for managing the configuration of spilo is a bit convoluted. I don't think using the spilo image as a base for anything would be a good idea (well, except that it does contain the timescaledb extensions by default...). |
|
Is there any chance that looking into the data loss will have priority soon? |
|
btw. @the-redback basically pg_rewind is the solution. the first versions of the postgres operator from zalando had a similar problem (i used both projects and even zalando patroni without k8s at all). the also used different solutions for leader election. the newest one is totally different than kubedb uses. (they only change endpoints) and as already said the problem can't happen if wal-g is used. |
|
@schmitch I had the problem happening with wal-g. |
|
@schmitch, I tried using pg_rewind back in 9.6 version may be. Is it more mature now? coz, It didn't work at that time. :( |
|
well it works obviously better than loosing data. |
Patroni:Pros:
Cons:
|
|
This is where spilo starts configuring if to use wal-e or wal-g, I'm not confident in this way of configuring something without a single test that it is actually accurate: https://github.com/zalando/spilo/blob/09b4cc3c18c0b1d8e666157171794a2a181cdc14/postgres-appliance/scripts/configure_spilo.py#L718 |
|
What is the status of the PR https://github.com/kubedb/postgres/pull/248? |
|
Guys, My vrp_vehicle gets deleted. Might it have something to do with mysql-async? I read a post that mysql-async is not supported anymore. Plus I see that I need to alter table cuz the maximum of lenght in the database is to long. I think it has do to with mysql or lscustoms.... |
Please delete this and your account, thanks. |
|
This issue has been resolved in KubeDB release v2021.03.17 . Please open a new bug if you still see this issue. |
Currently, when the
replica node(pod) starts/restarts, it deletes its data directory, then takespg_basebackupof primary node, then starts as standby server. ref: https://github.com/kubedb/postgres/blob/b0ed4c6ee2ab737001ef9fc2e3e1b1d48f6bfa24/hack/docker/postgres/9.6.7/scripts/replica/run.sh#L8Though we can avoid this 'deleteion & pg_basebackup' for replica pod restarts, yet problem can arise if the replica node is unavailable for a long time, and some WAL objects are deleted after that. So, after becoming available, it will lack WAL files, so data loss can happen.
some reference regarding this topic.
Though. replication_slot is a good alternative, it needs to be drop when coresponding replica node removes itself from cluster. The problem is, It's hard to determine which pod is terminating for good, and which one is restarting.
The text was updated successfully, but these errors were encountered: