Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: RemoveImage: handle failure in VDS command #656

Merged
merged 1 commit into from
Sep 22, 2022

Conversation

mkemel
Copy link
Member

@mkemel mkemel commented Sep 15, 2022

RemoveImageCommand used to call for DeleteImageGroup VDS command, and perform the required DB operations without polling on the Async Task.

One issue that it led to is when disk is configured with "Wipe on Delete", and the wipe action took more than 50 hours (default max async task polling time) - the task was interrupted, the image is not deleted in VDSM, but removed from DB, not having any tracking for them in the engine.

In this patch we add async task polling to RemoveImageCommand and handle the removal of the image from the DB only if it succeeded, otherwise we mark the image as ILLEGAL, and add an Audit Log, logging the error and the IDs of the volumes that might need manual removal. Additional removal of the now ILLEGAL disk will succeed, as the command will not find any volumes associated with it in VDSM

Bug-Url: https://bugzilla.redhat.com/1836318

@mkemel
Copy link
Member Author

mkemel commented Sep 15, 2022

@ahadas @bennyz
So async task polling on RemoveImage works well, but I still can't figure out what to do with the failure.
We are left in VDSM with image with deleted volume

image:    81d09b66-3fde-432f-b3ad-f8781ce7de45
             - cc81defd-f99c-49bd-bf10-ad48dfd869ed
               status: REMOVED, voltype: LEAF, format: COW, legality: LEGAL, type: PREALLOCATED, capacity: 5368709120, truesize: 5368709120

And lv with a broken tag

  cc81defd-f99c-49bd-bf10-ad48dfd869ed c611fa6f-5272-4629-a6c5-6f94a37a0789 -wi-------   5.00g
IU__remove_me_ZERO_81d09b66-3fde-432f-b3ad-f8781ce7de45,MD_6,PU_00000000-0000-0000-0000-000000000000

So we can do getVolumeInfo on all the volumes of the disk (there's not getImageInfo for image group as I see), and leave the disk there in the DB, maybe allowing to run the remove again?
But @ahadas 's solution with removing from DB anyway and leaving an appropriate Audit Log seems good to me as well. What do you think?

@mkemel mkemel force-pushed the db_sync_for_remove_image branch 2 times, most recently from 7c3c2d8 to cc1ef95 Compare September 20, 2022 15:14
@mkemel
Copy link
Member Author

mkemel commented Sep 21, 2022

/ost

Copy link
Member

@barpavel barpavel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just minor comments.

Copy link
Member

@barpavel barpavel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@barpavel
Copy link
Member

Just missed why after all RemoveImageCommand::setSucceeded() is needed if it only calls it's super.setSucceeded()...

@ahadas
Copy link
Member

ahadas commented Sep 22, 2022

Just missed why after all RemoveImageCommand::setSucceeded() is needed if it only calls it's super.setSucceeded()...

+1
@mkemel can you explain that part?

RemoveImageCommand used to call for DeleteImageGroup VDS command, and
perform the required DB operations without polling on the Async Task.

One issue that it led to is when disk is configured with "Wipe on
Delete", and the wipe action took more than 50 hours (default max
async task polling time) - the task was interrupted, the image is not
deleted in VDSM, but removed from DB, not having any tracking for them
in the engine.

In this patch we add async task polling to RemoveImageCommand and
handle the removal of the image from the DB only if it succeeded,
otherwise we mark the image as ILLEGAL, and add an Audit Log, logging
the error and the IDs of the volumes that might need manual removal.
Additional removal of the now ILLEGAL disk will succeed, as the command
will not find any volumes associated with it in VDSM

Bug-Url: https://bugzilla.redhat.com/1836318
@mkemel
Copy link
Member Author

mkemel commented Sep 22, 2022

@barpavel @ahadas
CommandBase::setSucceeded is protected, and cannot be used from RemoveImageCommandCallback
There's a compilation error on this one.
'setSucceeded(boolean)' has protected access in 'org.ovirt.engine.core.bll.CommandBase'

@barpavel
Copy link
Member

@barpavel @ahadas CommandBase::setSucceeded is protected, and cannot be used from RemoveImageCommandCallback There's a compilation error on this one. 'setSucceeded(boolean)' has protected access in 'org.ovirt.engine.core.bll.CommandBase'

Bummer, RemoveImageCommandCallback & CommandBase in different packages :)
Similar problem sometimes arises when writing tests.

@ahadas
Copy link
Member

ahadas commented Sep 22, 2022

/ost

@ahadas
Copy link
Member

ahadas commented Sep 22, 2022

unrelated failure in OST (UI tests)

@ahadas ahadas merged commit 0b92d02 into oVirt:master Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants