status missed even if backup job succeeds (#3648) #3667

ti-srebot · 2020-12-30T13:30:05Z

cherry-pick #3648 to release-1.1

What problem does this PR solve?

Fix the issue that in some cases some fields, e.g. BACKUPPATH, BACKUPSIZE, COMMITTS, etc. in the status of Backup CR or Restore CR are missed even if the backup or restore job succeed.
Fix #3635

What is changed and how does it work?

Root cause:
Take the Backup CR for example, and the Restore CR has a similar logic.
Currently, two controllers will update the status of Backup CR, one is the backup controller in the tidb-controller-manager Pod, which only updates the conditions slice in the status. The other is the controller in the backup-manager Pod which is the real backup job and will update all the fields in the status.
With the current status update logic, if something wrong with the update API call, the status updater will retrieve the latest Backup resource from the lister and replace the status field totally with the old status, so if the backup-manager Pod updates all the fields in the status and then the tidb-controller-manager Pod updates the conditions slice in the status, all the other fields updated by the backup-manager will be lost.
So we should update the status based on the latest status instead of the old status.

Code changes

Has Go code change
Has CI related scripts change

Tests

Unit test
E2E test
Manual test
- Run BackupSchedule with BR
  - The clean of the Backup CRs works as expected
  - The status of the Backup is updated correctly even if an error occurs during the update:

E1230 07:06:10.482472       1 backup_status_updater.go:87] Failed to update backup [ns0/minio-br-2020-12-30t07-06-00], error: Operation cannot be fulfilled on backups.pingcap.com "minio-br-2020-12-30t07-06-00": the object has been modified; please apply your changes to the latest version and try again
I1230 07:06:10.498200       1 backup_status_updater.go:84] Backup: [ns0/minio-br-2020-12-30t07-06-00] updated successfully

# k get bk -n ns0
NAME                                  STATUS     BACKUPPATH                                                     BACKUPSIZE   COMMITTS             AGE
minio-br-2020-12-30t06-26-00          Complete   s3://backup/ns0/br/ns0-pd.ns0-2379-2020-12-30t06-26-00         278 kB       421870849284636673   38m
minio-br-2020-12-30t06-28-00          Complete   s3://backup/ns0/br/ns0-pd.ns0-2379-2020-12-30t06-28-00         278 kB       421870881108918275   36m
minio-br-2020-12-30t06-30-00          Complete   s3://backup/ns0/br/ns0-pd.ns0-2379-2020-12-30t06-30-00         278 kB       421870912146767873   34m
minio-br-2020-12-30t06-32-00          Complete   s3://backup/ns0/br/ns0-pd.ns0-2379-2020-12-30t06-32-00         278 kB       421870943866191875   32m

Run BackupSchedule with Dumpling
- The clean of the Backup CRs works as expected
- The status of the Backup is updated correctly even if an error occurs during the update:

E1230 07:08:09.489023       1 backup_status_updater.go:87] Failed to update backup [ns0/minio-dump-2020-12-30t07-08-00], error: Operation cannot be fulfilled on backups.pingcap.com "minio-dump-2020-12-30t07-08-00": the object has been modified; please apply your changes to the latest version and try again
I1230 07:08:09.890044       1 backup_status_updater.go:84] Backup: [ns0/minio-dump-2020-12-30t07-08-00] updated successfully

# k get bk -n ns0
NAME                                  STATUS     BACKUPPATH                                                     BACKUPSIZE   COMMITTS             AGE
minio-dump-2020-12-30t06-26-00        Complete   s3://backup/ns0/dumpling/backup-2020-12-30T06:26:16Z.tgz       721 B        421870849520566278   38m
minio-dump-2020-12-30t06-28-00        Complete   s3://backup/ns0/dumpling/backup-2020-12-30T06:28:15Z.tgz       721 B        421870880728809476   36m
minio-dump-2020-12-30t06-30-00        Complete   s3://backup/ns0/dumpling/backup-2020-12-30T06:30:15Z.tgz       721 B        421870912382697476   34m

Run Restore with BR and the status of the Restore is updated correctly even if an error occurs during the update:

E1230 06:45:05.777648       1 restore_status_updater.go:79] Failed to update resotre [ns1/br-minio0], error: Operation cannot be fulfilled on restores.pingcap.com "br-minio0": the object has been modified; please apply your changes to the latest version and try again
I1230 06:45:05.794604       1 restore_status_updater.go:76] Restore: [ns1/br-minio0] updated successfully

# k get rt -n ns1
NAME                 STATUS     STARTED                COMPLETED              COMMITTS             AGE
br-minio0            Complete   2020-12-30T06:44:53Z   2020-12-30T06:45:05Z   421870503280771073   28m

Run Restore with Lightning and the status of the Restore is updated correctly even if an error occurs during the update:

E1230 06:41:13.551646       1 restore_status_updater.go:79] Failed to update resotre [ns1/light-minio0], error: Operation cannot be fulfilled on restores.pingcap.com "light-minio0": the object has been modified; please apply your changes to the latest version and try again
I1230 06:41:13.568063       1 restore_status_updater.go:76] Restore: [ns1/light-minio0] updated successfully

# k get rt -n ns1
NAME                 STATUS     STARTED                COMPLETED              COMMITTS             AGE
light-minio0         Complete   2020-12-30T06:41:12Z   2020-12-30T06:41:13Z   421870441493430276   34m

Side effects

Breaking backward compatibility
Other side effects:

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

Fix the issue that `commitTS` or other fields in status may be lost even if the backup job succeeds

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

ti-srebot · 2020-12-30T13:30:14Z

@DanielZhangQD please accept the invitation then you can push to the cherry-pick pull requests.
https://github.com/ti-srebot/tidb-operator/invitations

DanielZhangQD

LGTM

DanielZhangQD · 2020-12-30T13:31:02Z

/merge

ti-chi-bot · 2020-12-30T13:31:04Z

This pull request has been accepted and is ready to merge.

Commit hash: 7a6478e

handlerww · 2020-12-30T13:34:25Z

/merge

ti-chi-bot · 2020-12-30T13:34:27Z

@handlerww: /merge is only allowed for the committers in list.

In response to this:

/merge

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/ti-community-prow repository.

codecov-io · 2020-12-30T13:38:24Z

Codecov Report

❗ No coverage uploaded for pull request base (release-1.1@489e89b). Click here to learn what that means.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             release-1.1    #3667   +/-   ##
==============================================
  Coverage               ?   52.09%           
==============================================
  Files                  ?      153           
  Lines                  ?    14920           
  Branches               ?        0           
==============================================
  Hits                   ?     7773           
  Misses                 ?     6411           
  Partials               ?      736

Flag	Coverage Δ
unittest	`52.09% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

cherry pick pingcap#3648 to release-1.1

7a6478e

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

ti-srebot mentioned this pull request Dec 30, 2020

status missed even if backup job succeeds #3648

Merged

9 tasks

ti-srebot added the type/1.1-cherry-pick label Dec 30, 2020

ti-srebot requested review from csuzhangxc, dragonly, handlerww and LinuxGit December 30, 2020 13:30

ti-srebot added this to the v1.1.10 milestone Dec 30, 2020

ti-srebot assigned DanielZhangQD Dec 30, 2020

ti-chi-bot requested a review from shuijing198799 December 30, 2020 13:30

DanielZhangQD approved these changes Dec 30, 2020

View reviewed changes

ti-chi-bot added the status/LGT1 label Dec 30, 2020

ti-chi-bot added the status/can-merge label Dec 30, 2020

handlerww approved these changes Dec 30, 2020

View reviewed changes

ti-chi-bot merged commit 90ea4e5 into pingcap:release-1.1 Dec 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

status missed even if backup job succeeds (#3648) #3667

status missed even if backup job succeeds (#3648) #3667

ti-srebot commented Dec 30, 2020

ti-srebot commented Dec 30, 2020

DanielZhangQD left a comment

DanielZhangQD commented Dec 30, 2020

ti-chi-bot commented Dec 30, 2020

handlerww commented Dec 30, 2020

ti-chi-bot commented Dec 30, 2020

codecov-io commented Dec 30, 2020

status missed even if backup job succeeds (#3648) #3667

status missed even if backup job succeeds (#3648) #3667

Conversation

ti-srebot commented Dec 30, 2020

What problem does this PR solve?

What is changed and how does it work?

Code changes

Tests

Side effects

Related changes

Release Notes

ti-srebot commented Dec 30, 2020

DanielZhangQD left a comment

Choose a reason for hiding this comment

DanielZhangQD commented Dec 30, 2020

ti-chi-bot commented Dec 30, 2020

handlerww commented Dec 30, 2020

ti-chi-bot commented Dec 30, 2020

codecov-io commented Dec 30, 2020

Codecov Report