MedusaBackup may have `status.finishTime` set to "1970-01-01T00:00:00Z" while backup is in progress #1301

c3-clement · 2024-04-29T12:12:04Z

What happened?

A MedusaBackup CR may have the field status.finishedTime set to "1970-01-01T00:00:00Z" while the backup is still in progress.

According to https://docs.k8ssandra.io/tasks/backup-restore/#checking-backup-completion , a backup is considered completed when status.finishedTime is set.

Did you expect to see something different?

I expected the field status.finishedTime to not be set while backup is in progress, as stated by documentation.

How to reproduce it (as minimally and precisely as possible):

Deploy k8ssandra cluster
Create a MedusaBackupJob
Create a MedusaTask in sync mode while the backup job is in pogress

If the backup sync is completed before the backup job is completed, the issue will happen.

Environment

K8ssandra Operator version:
1.14
Kubernetes version information:
1.27 and 1.29
Kubernetes cluster kind:
EKS, AKS and GKE
K8ssandra Operator Logs:

insert K8ssandra Operator logs relevant to the issue here

Anything else we need to know?:

I'm pretty sure the issue is caused by this line
https://github.com/k8ssandra/k8ssandra-operator/blob/main/controllers/medusa/medusatask_controller.go#L328

It will set finishTime to the zero unix timestamp (1970-01-01)
however, the zero metav1.Time timestamp is not 1970-01-01 but 0001-01-01 (January 1, year 1, 00:00:00 UTC)

Manifests:

apiVersion: v1
items:
- apiVersion: medusa.k8ssandra.io/v1alpha1
  kind: MedusaBackup
  metadata:
    creationTimestamp: "2024-04-26T17:00:04Z"
    generation: 1
    name: k8scass-cs-001-k8scass-001-k9sbq
    namespace: eksdoucy
    resourceVersion: "5686257"
    uid: b793f159-5cf0-48b9-9dd5-23f63e49e93f
  spec:
    backupType: differential
    cassandraDatacenter: k8scass-001
  status:
    finishTime: "1970-01-01T00:00:00Z"
    finishedNodes: 1
    nodes:
    - datacenter: k8scass-001
      host: k8scass-cs-001-k8scass-001-default-sts-1
      rack: default
      tokens:
      - -2589360646730023400
      - -3653701636503086600
      - -4912122227646739000
      - -7033817127648975000
      - -769356345750076800
      - -8091193644577933000
      - 1107707983762909200
      - 2276662906941923600
      - 3325786707271115000
      - 4261699880062630000
      - 4677337539634270000
      - 5889017886306251000
      - 6286340738654964000
      - 7544761454600407000
      - 8579062101275812000
      - 9005348602556940000
    - datacenter: k8scass-001
      host: k8scass-cs-001-k8scass-001-default-sts-0
      rack: default
      tokens:
      - -1364658714710476000
      - -2182330647322826800
      - -288902833484896500
      - -3200182052065254000
      - -4271682668405287000
      - -5711950898307870000
      - -6556327545799032000
      - -7647624656491244000
      - -8743666556726182000
      - 1526385666673598700
      - 2769905672354191000
      - 3737269915432726000
      - 447012859225997100
      - 5327274467946625000
      - 6946297763643668000
      - 7924443216159782000
    - datacenter: k8scass-001
      host: k8scass-cs-001-k8scass-001-default-sts-2
      rack: default
      tokens:
      - -1689492637652962000
      - -2827619177042857000
      - -3889971576713209000
      - -4552103083015450000
      - -5216611785192708000
      - -6043458241930822000
      - -7285896556610953000
      - -8343453241301002000
      - -9010383990221943000
      - 142423091302120580
      - 1973957006571479000
      - 5045680731261046000
      - 6678161695265310000
      - 7311145518738443000
      - 8289036375131932000
      - 856718555699781500
    startTime: "2024-04-26T16:59:13Z"
    status: IN_PROGRESS
    totalNodes: 3
kind: List
metadata:
  resourceVersion: ""

apiVersion: v1
items:
- apiVersion: medusa.k8ssandra.io/v1alpha1
  kind: MedusaRestoreJob
  metadata:
    creationTimestamp: "2024-04-26T17:10:41Z"
    generateName: k8scass-cs-001-k8scass-001-k9sbq-
    generation: 1
    labels:
      app: c3aiops
      ops.c3.ai/parent-resource: eksdoucy-c3cassandrarestore-k8scass-cs-001-6pr92
      role: c3aiops-C3CassandraRestore
    name: k8scass-cs-001-k8scass-001-k9sbq-rh4hx
    namespace: eksdoucy
    ownerReferences:
    - apiVersion: cassandra.datastax.com/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: CassandraDatacenter
      name: k8scass-001
      uid: 2ad0341f-cb57-425e-9a3f-533a31a83152
    resourceVersion: "5691659"
    uid: 596f85ce-01da-464a-a666-482f330090ae
  spec:
    backup: k8scass-cs-001-k8scass-001-k9sbq
    cassandraDatacenter: k8scass-001
  status:
    finishTime: "2024-04-26T17:10:56Z"
    message: target backup has not completed successfully
    restoreKey: 6dca1918-1fdf-47dd-86b1-3f29f0b14c9b
    restoreMapping: {}
    startTime: "2024-04-26T17:10:56Z"
kind: List
metadata:
  resourceVersion: ""

The text was updated successfully, but these errors were encountered:

mlallaouret · 2024-04-30T15:19:14Z

Hello,

I have the same issue on several clusters. The MedusaBackupJob are triggered by a MedusaBackupSchedule on my side.
Additionally, the following MedusaBackupJob are not triggered anymore.
Could it be related to the last MedusaBackup status ?

mlallaouret · 2024-05-02T08:00:45Z

Here are some logs to help debug the issue.

In the k8ssandra-operator logs, it says that the still "IN_PROGRESS" (but with finished time set to 1970-01-01T00:00:00Z ) MedusaBackupJob is blocking the next backup schedule:

2024-05-02T07:46:58.393Z	DEBUG	Postponing backup schedule due to existing active backups	{"controller": "medusabackupschedule", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaBackupSchedule", "MedusaBackupSchedule": {"name":"medusa-backup-schedule-dc1","namespace":"cassandra"}, "namespace": "cassandra", "name": "medusa-backup-schedule-dc1", "reconcileID": "7e23a08a-864f-40dd-bb88-47195370663e", "medusabackupscheduler": "cassandra/medusa-backup-schedule-dc1", "MedusaBackupSchedule": {"namespace": "cassandra", "name": "medusa-backup-schedule-dc1"}}

In my case there is only one node which did not finished the backup, and here are the logs of the medusa container which are repeating:

[2024-05-02 07:58:46,167] INFO: Using service file: /etc/medusa-secrets/credentials
[2024-05-02 07:58:46,167] DEBUG: Connecting to Google Storage
[2024-05-02 07:58:46,168] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql was not found in cache.
[2024-05-02 07:58:46,168] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql
[2024-05-02 07:58:46,325] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql last modification time is 2024-04-06 02:00:16.517000
[2024-05-02 07:58:46,326] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json was not found in cache.
[2024-05-02 07:58:46,326] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json
[2024-05-02 07:58:46,348] DEBUG: No manifest blob for backup medusa-backup-schedule-dc1-1712368800 of fqdn main-cluster-dc1-rack1-sts-2
[2024-05-02 07:58:46,348] DEBUG: Registered backup name medusa-backup-schedule-dc1-1712368800 found existing, replacing with new
[2024-05-02 07:58:46,348] DEBUG: Cancelling backup` id: medusa-backup-schedule-dc1-1712368800
[2024-05-02 07:58:46,348] INFO: Registered backup id medusa-backup-schedule-dc1-1712368800
[2024-05-02 07:58:46,348] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json was not found in cache.
[2024-05-02 07:58:46,349] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json
[2024-05-02 07:58:46,372] DEBUG: No manifest blob for backup medusa-backup-schedule-dc1-1712368800 of fqdn main-cluster-dc1-rack1-sts-2
[2024-05-02 07:58:46,373] DEBUG: Updated from existing status: -1 to new status: 0 for backup id: medusa-backup-schedule-dc1-1712368800 
[2024-05-02 07:58:46,373] DEBUG: Disconnecting from Google Storage

adejanovski · 2024-05-02T08:05:29Z

@rzvoncek, did we change something recently that would set the finishTime where it wasn't the case before?

mlallaouret · 2024-05-02T15:58:44Z

For information, deleting the MedusaBackupJob stucked with IN_PROGRESS status triggers a new MedusaBackupJob from the MedusaBackupSchedule.

rzvoncek · 2024-05-07T11:19:43Z

Hi. I tried to reproduce this, but I could not.

On a fresh deployment, I tried starting the sync task always after creating the MedusaBackupJob, but sometimes before that started (got the start timestamp) and sometimes after. I did this on the current main, and 1.14 k8ssandra operator.
What I saw, in both cases, was that doing the sync before the start timestamp appeared broke the eventual MedusaBackup, which did not correctly show file count and size.

It might be that I'd need a MedusaBackupJob stuck in IN_PROGRESS, which currently happens a lot because of a bug we fixed here. I think I'll go ahead and try to reproduce this as well.

c3-clement · 2024-05-07T11:48:48Z

Hi. I tried to reproduce this, but I could not.

On a fresh deployment, I tried starting the sync task always after creating the MedusaBackupJob, but sometimes before that started (got the start timestamp) and sometimes after. I did this on the current main, and 1.14 k8ssandra operator. What I saw, in both cases, was that doing the sync before the start timestamp appeared broke the eventual MedusaBackup, which did not correctly show file count and size.

What is the number of replicas of you Cassandra datacenter? I observer that with 1 replica, the issue is hard to reproduce, but with 3 replicas, the issue happened consistently (since backup job will take longer)

Pylypp · 2024-05-27T11:27:21Z

I've got the same issue on 6 node cluster. trying to fix it by cleaning all medusa tasks, backup jobs and schedules, snapshots etc. no luck atm

adejanovski · 2024-05-28T15:25:45Z

Hi folks, the only place where this could happen in the code is here.
If we have a 0 value for the finishTime, it's going to convert to 1970-01-01T00:00:00Z.

Nonetheless, this means the sync is running concurrently with a backup which isn't finished yet, and creating that MedusaBackup object.

It's most probable that the sync should only locally persist completed backups, since there's nothing to do anyway with incomplete ones.

I'll prepare a PR for this, so that if we don't have a finish time we don't sync the MedusaBackup object.

c3-clement added the bug Something isn't working label Apr 29, 2024

c3-clement changed the title ~~MedusaBackup may have status.finishedTime set to "1970-01-01T00:00:00Z" while backup is in progress~~ MedusaBackup may have status.finishTime set to "1970-01-01T00:00:00Z" while backup is in progress Apr 29, 2024

c3-clement mentioned this issue Apr 29, 2024

How to determine if MedusaBackupJob and MedusaRestoreJob are completed successfully? #1302

Open

adejanovski added the assess Issues in the state 'assess' label May 2, 2024

adejanovski assigned rzvoncek May 2, 2024

adejanovski assigned adejanovski and unassigned rzvoncek May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MedusaBackup may have `status.finishTime` set to "1970-01-01T00:00:00Z" while backup is in progress #1301

MedusaBackup may have `status.finishTime` set to "1970-01-01T00:00:00Z" while backup is in progress #1301

c3-clement commented Apr 29, 2024 •

edited

mlallaouret commented Apr 30, 2024

mlallaouret commented May 2, 2024

adejanovski commented May 2, 2024

mlallaouret commented May 2, 2024

rzvoncek commented May 7, 2024

c3-clement commented May 7, 2024 •

edited

Pylypp commented May 27, 2024

adejanovski commented May 28, 2024

MedusaBackup may have status.finishTime set to "1970-01-01T00:00:00Z" while backup is in progress #1301

MedusaBackup may have status.finishTime set to "1970-01-01T00:00:00Z" while backup is in progress #1301

Comments

c3-clement commented Apr 29, 2024 • edited

mlallaouret commented Apr 30, 2024

mlallaouret commented May 2, 2024

adejanovski commented May 2, 2024

mlallaouret commented May 2, 2024

rzvoncek commented May 7, 2024

c3-clement commented May 7, 2024 • edited

Pylypp commented May 27, 2024

adejanovski commented May 28, 2024

MedusaBackup may have `status.finishTime` set to "1970-01-01T00:00:00Z" while backup is in progress #1301

MedusaBackup may have `status.finishTime` set to "1970-01-01T00:00:00Z" while backup is in progress #1301

c3-clement commented Apr 29, 2024 •

edited

c3-clement commented May 7, 2024 •

edited