Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug: if the snapshot is no longer in engine CR, don't block the removal process #2074

Merged
merged 2 commits into from Jul 12, 2023

Conversation

PhanLe1010
Copy link
Contributor

removal process

Longhorn-6298

Signed-off-by: Phan Le <phan.le@suse.com>
Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize you were already on a PR for this, so I drafted #2075. It is similar in concept (allow us to proceed to remove the finalizer if the snapshot is not in the engine). However, I was hoping to just reorder things without adding additional checks.

I'll defer to you since you know this section of code best. LGTM, but need to test it.

@PhanLe1010
Copy link
Contributor Author

@ejweber I think we still need to check with the engine process before removing the snapshot CR's finalizer. This PR will do that. What do you think?

@ejweber
Copy link
Contributor

ejweber commented Jul 11, 2023

We discussed my "competing" PR offline and are in agreement that this is the right approach.

We don't have a simple recreate, but I can use the same iterative test I opened the issue with to check whether my cluster continues to accrue snapshots with this fix.

Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After substantial discussion and explanation, this makes sense to me.

@PhanLe1010 PhanLe1010 marked this pull request as ready for review July 11, 2023 21:04
@PhanLe1010 PhanLe1010 requested a review from a team as a code owner July 11, 2023 21:04
@innobead innobead self-requested a review July 12, 2023 05:16
Copy link
Member

@innobead innobead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

nit: is it possible to have an orphaned snapshot always unable to delete because it can't be deleted at a replica side somehow, so the volume keeps in an auto-attaching state (surely, ticket type snapshot should be interruptable, so it should not impact other operations ideally)?

cc @shuo-wu @derekbit

@innobead
Copy link
Member

@mergify backport v1.5.x

@mergify
Copy link

mergify bot commented Jul 12, 2023

backport v1.5.x

✅ Backports have been created

@innobead
Copy link
Member

innobead commented Jul 12, 2023

Also, auto-cleanup-system-generated-snapshot was introduced in 1.4 as well, so we don't need to tackle this issue even thought the implementation is different than 1.5 which is using longhorn VA.

Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested as suggested in #2074 (comment). After 15 iterations my cluster has only 50 snapshots, and each snapshot is only less than five minutes old. This is as expected.

@PhanLe1010
Copy link
Contributor Author

LGTM.

nit: is it possible to have an orphaned snapshot always unable to delete because it can't be deleted at a replica side somehow, so the volume keeps in an auto-attaching state (surely, ticket type snapshot should be interruptable, so it should not impact other operations ideally)?

cc @shuo-wu @derekbit

If the snapshot is stuck in the removed state in the replica, yes, the volume will remain attached due to the snapshot-controller attachment ticket.

If workload starts on different node, it will interrupt the snapshot AD ticket. For other operations that require attachment, they will request the same node as the snapshot AD ticket. So I think it is fine

@PhanLe1010
Copy link
Contributor Author

Also, auto-cleanup-system-generated-snapshot was introduced in 1.4 as well, so we don't need to tackle this issue even thought the implementation is different than 1.5 which is using longhorn VA.

Sorry, could you elaborate more on this?

I think this issue should happen in 1.5.x only because of the new AD mechanism

@PhanLe1010 PhanLe1010 merged commit a5b67e5 into longhorn:master Jul 12, 2023
6 checks passed
@innobead
Copy link
Member

Sounds good.

@innobead
Copy link
Member

Also, auto-cleanup-system-generated-snapshot was introduced in 1.4 as well, so we don't need to tackle this issue even thought the implementation is different than 1.5 which is using longhorn VA.

Sorry, could you elaborate more on this?

I think this issue should happen in 1.5.x only because of the new AD mechanism

It has been clarified. All good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants