Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

br: clean volumes when restore volume failed #5634

Merged
merged 6 commits into from
May 6, 2024

Conversation

WangLe1321
Copy link
Contributor

@WangLe1321 WangLe1321 commented Apr 24, 2024

What problem does this PR solve?

Add cleaning volumes function when restore volume failed. Then we can avoid volume leak

Closes #5638

What is changed and how does it work?

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code

manual test steps:

  1. create a cluster
  2. create a volume backup
  3. create a restore cluster
  4. create a volume restore using the volume backup in step 2. in this step, restore volume is successful and the tikv pods are started
  5. edit the restore CR, modify its status from VolumeComplete to Failed to mock restore volume failed scenario
  6. edit the tc CR, remove the annotation tidb.pingcap.com/tikv-volumes-ready to block tikv creation, then delete the tikv statefulset to detach the EBS volumes
  7. wait for the volumes detached and deleted by tidb-operator

image

image

image

Side effects

  • Breaking backward compatibility
  • Other side effects:

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.


@ti-chi-bot ti-chi-bot bot added the size/L label Apr 24, 2024
@WangLe1321
Copy link
Contributor Author

/run-pull-e2e-kind-br

@codecov-commenter
Copy link

codecov-commenter commented Apr 29, 2024

Codecov Report

Attention: Patch coverage is 0% with 83 lines in your changes are missing coverage. Please review.

Project coverage is 47.81%. Comparing base (3897095) to head (bf2e796).
Report is 3 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5634       +/-   ##
===========================================
- Coverage   61.46%   47.81%   -13.65%     
===========================================
  Files         235      219       -16     
  Lines       30397    30307       -90     
===========================================
- Hits        18683    14492     -4191     
- Misses       9840    14095     +4255     
+ Partials     1874     1720      -154     
Flag Coverage Δ
e2e 47.81% <0.00%> (?)
unittest ?

@BornChanger
Copy link
Contributor

/retest

@ti-chi-bot ti-chi-bot bot removed the lgtm label May 4, 2024
@BornChanger
Copy link
Contributor

/retest

@BornChanger
Copy link
Contributor

/run-pull-e2e-kind-basic

@ti-chi-bot ti-chi-bot bot added the lgtm label May 6, 2024
Copy link
Contributor

ti-chi-bot bot commented May 6, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BornChanger, csuzhangxc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [BornChanger,csuzhangxc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

ti-chi-bot bot commented May 6, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-05-04 02:50:36.183869862 +0000 UTC m=+671189.941005436: ☑️ agreed by BornChanger.
  • 2024-05-04 02:51:05.269796707 +0000 UTC m=+671219.026932276: ✖️🔁 reset by ti-chi-bot[bot].
  • 2024-05-06 03:40:45.043388677 +0000 UTC m=+846998.800524249: ☑️ agreed by csuzhangxc.

@ti-chi-bot ti-chi-bot bot merged commit b9b80c7 into pingcap:master May 6, 2024
13 checks passed
@csuzhangxc
Copy link
Member

/cherry-pick release-1.5

@ti-chi-bot
Copy link
Member

@csuzhangxc: new pull request created to branch release-1.5: #5639.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

csuzhangxc pushed a commit that referenced this pull request May 6, 2024
Co-authored-by: WangLe1321 <wangle1321@163.com>
Copy link
Contributor

@nkg- nkg- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks for working on it. Left a couple of comments.

@@ -130,6 +130,19 @@ func IsRestoreVolumeComplete(restore *Restore) bool {
return condition != nil && condition.Status == corev1.ConditionTrue
}

// IsRestoreVolumeFailed returns true if a Restore for volume is Failed
func IsRestoreVolumeFailed(restore *Restore) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand this what exactly does IsRestoreVolumeFailed entail.

We should only delete volumes, if restore fails in the first step (ie RestoreVolume which creates the volumes), and before creating PV/PVCs which attach volumes to ec2 nodes. Its the volumes which are never attached to an ec2 node (via PVCs) which cause a leak. After PVCs are created, they are tracked and deleted by csi driver.

Also If they restore fails during warmup, or restore_data, its actually useful to investigate the tikvs on what happened. If the volumes are deleted, then it would be hard to investigate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create pv/pvc and warm up volumes and restore data only if the restore CR has the status RestoreVolumeComplete. And if restore volume failed, the status of restore CR is Failed. So this condition can ensure the restore is failed at restore volume step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So !IsRestoreVolumeComplete(restore) is the condition which ensure restorevolume step has failed.

@@ -204,6 +205,68 @@ func (e *EC2Session) DeleteSnapshots(snapIDMap map[string]string, deleteRatio fl
return nil
}

func (e *EC2Session) DeleteVolumes(volumeIDs []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only delete volumes, which are unattached. How is the validated here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the volume is attached, the delete request will fail. It means, we can't delete an attached volume. In my test, if the tikv pod is running, the volume can't be deleted. This is the reason why I deleted the tikv statefulset in my test steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Basically we rely in aws api contract to prevent attached volumes from getting deleted. Deletes the specified EBS volume. The volume must be in the available state (not attached to an instance).

https://docs.aws.amazon.com/cli/latest/reference/ec2/delete-volume.html

Copy link
Contributor

@nkg- nkg- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the responses.

@@ -130,6 +130,19 @@ func IsRestoreVolumeComplete(restore *Restore) bool {
return condition != nil && condition.Status == corev1.ConditionTrue
}

// IsRestoreVolumeFailed returns true if a Restore for volume is Failed
func IsRestoreVolumeFailed(restore *Restore) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So !IsRestoreVolumeComplete(restore) is the condition which ensure restorevolume step has failed.

@@ -204,6 +205,68 @@ func (e *EC2Session) DeleteSnapshots(snapIDMap map[string]string, deleteRatio fl
return nil
}

func (e *EC2Session) DeleteVolumes(volumeIDs []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Basically we rely in aws api contract to prevent attached volumes from getting deleted. Deletes the specified EBS volume. The volume must be in the available state (not attached to an instance).

https://docs.aws.amazon.com/cli/latest/reference/ec2/delete-volume.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Volume leaked if volume restore failed in restore volume step
6 participants