Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MedusaBackup may have status.finishTime set to "1970-01-01T00:00:00Z" while backup is in progress #1301

Open
c3-clement opened this issue Apr 29, 2024 · 8 comments
Assignees
Labels
assess Issues in the state 'assess' bug Something isn't working

Comments

@c3-clement
Copy link
Contributor

c3-clement commented Apr 29, 2024

What happened?

A MedusaBackup CR may have the field status.finishedTime set to "1970-01-01T00:00:00Z" while the backup is still in progress.

image

According to https://docs.k8ssandra.io/tasks/backup-restore/#checking-backup-completion , a backup is considered completed when status.finishedTime is set.

Did you expect to see something different?

I expected the field status.finishedTime to not be set while backup is in progress, as stated by documentation.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy k8ssandra cluster
  2. Create a MedusaBackupJob
  3. Create a MedusaTask in sync mode while the backup job is in pogress

If the backup sync is completed before the backup job is completed, the issue will happen.

Environment

  • K8ssandra Operator version:
    1.14

  • Kubernetes version information:
    1.27 and 1.29

  • Kubernetes cluster kind:
    EKS, AKS and GKE

  • K8ssandra Operator Logs:

insert K8ssandra Operator logs relevant to the issue here

Anything else we need to know?:

I'm pretty sure the issue is caused by this line
https://github.com/k8ssandra/k8ssandra-operator/blob/main/controllers/medusa/medusatask_controller.go#L328

It will set finishTime to the zero unix timestamp (1970-01-01)
however, the zero metav1.Time timestamp is not 1970-01-01 but 0001-01-01 (January 1, year 1, 00:00:00 UTC)

  • Manifests:
apiVersion: v1
items:
- apiVersion: medusa.k8ssandra.io/v1alpha1
  kind: MedusaBackup
  metadata:
    creationTimestamp: "2024-04-26T17:00:04Z"
    generation: 1
    name: k8scass-cs-001-k8scass-001-k9sbq
    namespace: eksdoucy
    resourceVersion: "5686257"
    uid: b793f159-5cf0-48b9-9dd5-23f63e49e93f
  spec:
    backupType: differential
    cassandraDatacenter: k8scass-001
  status:
    finishTime: "1970-01-01T00:00:00Z"
    finishedNodes: 1
    nodes:
    - datacenter: k8scass-001
      host: k8scass-cs-001-k8scass-001-default-sts-1
      rack: default
      tokens:
      - -2589360646730023400
      - -3653701636503086600
      - -4912122227646739000
      - -7033817127648975000
      - -769356345750076800
      - -8091193644577933000
      - 1107707983762909200
      - 2276662906941923600
      - 3325786707271115000
      - 4261699880062630000
      - 4677337539634270000
      - 5889017886306251000
      - 6286340738654964000
      - 7544761454600407000
      - 8579062101275812000
      - 9005348602556940000
    - datacenter: k8scass-001
      host: k8scass-cs-001-k8scass-001-default-sts-0
      rack: default
      tokens:
      - -1364658714710476000
      - -2182330647322826800
      - -288902833484896500
      - -3200182052065254000
      - -4271682668405287000
      - -5711950898307870000
      - -6556327545799032000
      - -7647624656491244000
      - -8743666556726182000
      - 1526385666673598700
      - 2769905672354191000
      - 3737269915432726000
      - 447012859225997100
      - 5327274467946625000
      - 6946297763643668000
      - 7924443216159782000
    - datacenter: k8scass-001
      host: k8scass-cs-001-k8scass-001-default-sts-2
      rack: default
      tokens:
      - -1689492637652962000
      - -2827619177042857000
      - -3889971576713209000
      - -4552103083015450000
      - -5216611785192708000
      - -6043458241930822000
      - -7285896556610953000
      - -8343453241301002000
      - -9010383990221943000
      - 142423091302120580
      - 1973957006571479000
      - 5045680731261046000
      - 6678161695265310000
      - 7311145518738443000
      - 8289036375131932000
      - 856718555699781500
    startTime: "2024-04-26T16:59:13Z"
    status: IN_PROGRESS
    totalNodes: 3
kind: List
metadata:
  resourceVersion: ""
apiVersion: v1
items:
- apiVersion: medusa.k8ssandra.io/v1alpha1
  kind: MedusaRestoreJob
  metadata:
    creationTimestamp: "2024-04-26T17:10:41Z"
    generateName: k8scass-cs-001-k8scass-001-k9sbq-
    generation: 1
    labels:
      app: c3aiops
      ops.c3.ai/parent-resource: eksdoucy-c3cassandrarestore-k8scass-cs-001-6pr92
      role: c3aiops-C3CassandraRestore
    name: k8scass-cs-001-k8scass-001-k9sbq-rh4hx
    namespace: eksdoucy
    ownerReferences:
    - apiVersion: cassandra.datastax.com/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: CassandraDatacenter
      name: k8scass-001
      uid: 2ad0341f-cb57-425e-9a3f-533a31a83152
    resourceVersion: "5691659"
    uid: 596f85ce-01da-464a-a666-482f330090ae
  spec:
    backup: k8scass-cs-001-k8scass-001-k9sbq
    cassandraDatacenter: k8scass-001
  status:
    finishTime: "2024-04-26T17:10:56Z"
    message: target backup has not completed successfully
    restoreKey: 6dca1918-1fdf-47dd-86b1-3f29f0b14c9b
    restoreMapping: {}
    startTime: "2024-04-26T17:10:56Z"
kind: List
metadata:
  resourceVersion: ""
@c3-clement c3-clement added the bug Something isn't working label Apr 29, 2024
@c3-clement c3-clement changed the title MedusaBackup may have status.finishedTime set to "1970-01-01T00:00:00Z" while backup is in progress MedusaBackup may have status.finishTime set to "1970-01-01T00:00:00Z" while backup is in progress Apr 29, 2024
@mlallaouret
Copy link

Hello,

I have the same issue on several clusters. The MedusaBackupJob are triggered by a MedusaBackupSchedule on my side.
Additionally, the following MedusaBackupJob are not triggered anymore.
Could it be related to the last MedusaBackup status ?

@mlallaouret
Copy link

Here are some logs to help debug the issue.

In the k8ssandra-operator logs, it says that the still "IN_PROGRESS" (but with finished time set to 1970-01-01T00:00:00Z ) MedusaBackupJob is blocking the next backup schedule:

2024-05-02T07:46:58.393Z	DEBUG	Postponing backup schedule due to existing active backups	{"controller": "medusabackupschedule", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaBackupSchedule", "MedusaBackupSchedule": {"name":"medusa-backup-schedule-dc1","namespace":"cassandra"}, "namespace": "cassandra", "name": "medusa-backup-schedule-dc1", "reconcileID": "7e23a08a-864f-40dd-bb88-47195370663e", "medusabackupscheduler": "cassandra/medusa-backup-schedule-dc1", "MedusaBackupSchedule": {"namespace": "cassandra", "name": "medusa-backup-schedule-dc1"}}

In my case there is only one node which did not finished the backup, and here are the logs of the medusa container which are repeating:

[2024-05-02 07:58:46,167] INFO: Using service file: /etc/medusa-secrets/credentials
[2024-05-02 07:58:46,167] DEBUG: Connecting to Google Storage
[2024-05-02 07:58:46,168] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql was not found in cache.
[2024-05-02 07:58:46,168] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql
[2024-05-02 07:58:46,325] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql last modification time is 2024-04-06 02:00:16.517000
[2024-05-02 07:58:46,326] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json was not found in cache.
[2024-05-02 07:58:46,326] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json
[2024-05-02 07:58:46,348] DEBUG: No manifest blob for backup medusa-backup-schedule-dc1-1712368800 of fqdn main-cluster-dc1-rack1-sts-2
[2024-05-02 07:58:46,348] DEBUG: Registered backup name medusa-backup-schedule-dc1-1712368800 found existing, replacing with new
[2024-05-02 07:58:46,348] DEBUG: Cancelling backup` id: medusa-backup-schedule-dc1-1712368800
[2024-05-02 07:58:46,348] INFO: Registered backup id medusa-backup-schedule-dc1-1712368800
[2024-05-02 07:58:46,348] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json was not found in cache.
[2024-05-02 07:58:46,349] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json
[2024-05-02 07:58:46,372] DEBUG: No manifest blob for backup medusa-backup-schedule-dc1-1712368800 of fqdn main-cluster-dc1-rack1-sts-2
[2024-05-02 07:58:46,373] DEBUG: Updated from existing status: -1 to new status: 0 for backup id: medusa-backup-schedule-dc1-1712368800 
[2024-05-02 07:58:46,373] DEBUG: Disconnecting from Google Storage

@adejanovski
Copy link
Contributor

@rzvoncek, did we change something recently that would set the finishTime where it wasn't the case before?

@adejanovski adejanovski added the assess Issues in the state 'assess' label May 2, 2024
@mlallaouret
Copy link

For information, deleting the MedusaBackupJob stucked with IN_PROGRESS status triggers a new MedusaBackupJob from the MedusaBackupSchedule.

@rzvoncek
Copy link
Contributor

rzvoncek commented May 7, 2024

Hi. I tried to reproduce this, but I could not.

On a fresh deployment, I tried starting the sync task always after creating the MedusaBackupJob, but sometimes before that started (got the start timestamp) and sometimes after. I did this on the current main, and 1.14 k8ssandra operator.
What I saw, in both cases, was that doing the sync before the start timestamp appeared broke the eventual MedusaBackup, which did not correctly show file count and size.

It might be that I'd need a MedusaBackupJob stuck in IN_PROGRESS, which currently happens a lot because of a bug we fixed here. I think I'll go ahead and try to reproduce this as well.

@c3-clement
Copy link
Contributor Author

c3-clement commented May 7, 2024

Hi. I tried to reproduce this, but I could not.

On a fresh deployment, I tried starting the sync task always after creating the MedusaBackupJob, but sometimes before that started (got the start timestamp) and sometimes after. I did this on the current main, and 1.14 k8ssandra operator. What I saw, in both cases, was that doing the sync before the start timestamp appeared broke the eventual MedusaBackup, which did not correctly show file count and size.

What is the number of replicas of you Cassandra datacenter? I observer that with 1 replica, the issue is hard to reproduce, but with 3 replicas, the issue happened consistently (since backup job will take longer)

@Pylypp
Copy link

Pylypp commented May 27, 2024

I've got the same issue on 6 node cluster. trying to fix it by cleaning all medusa tasks, backup jobs and schedules, snapshots etc. no luck atm

@adejanovski adejanovski assigned adejanovski and unassigned rzvoncek May 28, 2024
@adejanovski
Copy link
Contributor

Hi folks, the only place where this could happen in the code is here.
If we have a 0 value for the finishTime, it's going to convert to 1970-01-01T00:00:00Z.

Nonetheless, this means the sync is running concurrently with a backup which isn't finished yet, and creating that MedusaBackup object.

It's most probable that the sync should only locally persist completed backups, since there's nothing to do anyway with incomplete ones.

I'll prepare a PR for this, so that if we don't have a finish time we don't sync the MedusaBackup object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assess Issues in the state 'assess' bug Something isn't working
Projects
Status: Assess/Investigate
Development

No branches or pull requests

5 participants