Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container #1463

Open
AlcipPopa opened this issue Mar 6, 2024 · 1 comment
Labels

Comments

@AlcipPopa
Copy link

AlcipPopa commented Mar 6, 2024

Report

MongoDB Backup is stuck on Status:Waiting and backup-agent container is not doing anything after Kubernetes scheduler restarted the backup-agent container during the execution of a restore:

Schermata del 2024-03-06 15-57-14

More about the problem

I expect to see an ongoing backup after asking for a backup through the PerconaServerMongoDBBackup yml definition, when other actions (backups / restores) are not in progress.

Steps to reproduce

Start a MongoDB cluster in unsafe mode with only 1 replica (this is usefull for development environments) and fill it with some data (let's say about 600MB of gzipped data);

Do a MongoDB backup and wait for the completion (Status = Ready) with the following yml (this will upload the backup to our AWS S3 bucket):

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  finalizers:
    - delete-backup
  name: backup1
spec:
  clusterName: mongodb-percona-cluster
  storageName: eu-central-1
  type: logical

Drop collections on MongoDB replicaset (just to avoid the _id clashes at next point);

Now ask for a restore of the above backup with the following yml (this works as intended since I saw the logs and the data inside MongoDB ReplicaSet):

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: restore1
spec:
  clusterName: mongodb-percona-cluster
  backupName: backup1

Ask for another backup with the following yml (keep in mind that at this point the previous restore process is still in progress)

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  finalizers:
    - delete-backup
  name: backup2
spec:
  clusterName: mongodb-percona-cluster
  storageName: eu-central-1
  type: logical

The backup2 will be put on Status=Waiting;

At this point Kubernetes scheduler should kill the backup-agent container from the MongoDB replica pod because of memory issues and restart it;

Now if you do a kubectl get psmdb-backup, you'll see that backup2 is in Error status and if you do a kubectl get psmdb-restore, you'll see that restore1 is also in Error status (OK, I can take that);

From this point onwards, no backup/restore will be possible through any yml, because they'll be appended as Status=Waiting.

The new backup-agent container logs state that it is waiting for incoming requests:

2024/03/05 16:36:01 [entrypoint] starting `pbm-agent`
2024-03-05T16:36:05.000+0000 I pbm-agent:
Version:   2.3.0
Platform:  linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19
2024-03-05T16:36:05.000+0000 I starting PITR routine
2024-03-05T16:36:05.000+0000 I node: rs0/mongodb-percona-cluster-rs0-0.mongodb-percona-cluster-rs0.default.svc.cluster.local:27017
2024-03-05T16:36:05.000+0000 I listening for the commands

Versions

  1. Kubernetes version v1.27.9 in a 8 nodes cluster with 4GB of RAM each, in Azure Cloud
  2. Operator image percona/percona-server-mongodb-operator:1.15.0
  3. Database image percona/percona-server-mongodb:5.0.20-17

Anything else?

Same bug applies also on cronjobs (so it's not an issue triggered by the on demand backup/restore requests): they are kept in Waiting status.
The bug does NOT happen when using a ReplicaSet with at least 3 replicas (the default topology).

@AlcipPopa AlcipPopa added the bug label Mar 6, 2024
@AlcipPopa AlcipPopa changed the title Backups/Restores are in Waiting Status after Kubernetes scheduler killed the backup-agent container Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container Mar 6, 2024
@spron-in
Copy link
Collaborator

spron-in commented Mar 7, 2024

Nice catch @AlcipPopa . @hors I think we had something in our backlog about it. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants