Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip thresholds validation for remove snapshot #614

Merged
merged 1 commit into from
Aug 28, 2022

Conversation

barpavel
Copy link
Member

@barpavel barpavel commented Aug 24, 2022

Try to remove snapshot even in case of low Storage Domain disk space.
Otherwise, when remove snapshot is performed at the end of Live Storage Migration operation, the RemoveSnapshotCommand fails on validation and leaves 3 chunks' snapshot (7.5 GiB) unreleased.
Additionally, the LSM operation is reported as failed, though the "move" part actually succeeded.

Due to extra 3 chunks that are temporary used during LSM flow, we might temporary fall below disk space threshold.
Currently not only that the code leaves unreleased an almost 8 GiB "junk", but it also leaves the Storage Domain in unhealthy low-space state which blocks other operations and requires a manual intervention instead of quietly recovering from this temporary low disk space situation by performing the proper cleanup.

Before the fix:

2022-08-23 13:14:56,448+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [] EVENT_ID: IRS_DISK_SPACE_LOW_ERROR(201), Critical, Low disk space. iSCSI_SD2 domain has 4 GB of free space.
2022-08-23 13:16:10,455+03 WARN  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [03ef74cf-ddbe-4027-805f-02631bf96929] Validation of action 'RemoveSnapshot' failed for user admin@internal-authz. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_DISK_SPACE_LOW_ON_STORAGE_DOMAIN,$storageName iSCSI_SD2
2022-08-23 13:16:15,555+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-17) [03ef74cf-ddbe-4027-805f-02631bf96929] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2022-08-23 13:16:15,581+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-17) [03ef74cf-ddbe-4027-805f-02631bf96929] EVENT_ID: USER_MOVE_IMAGE_GROUP_FAILED_TO_DELETE_SRC_IMAGE(2,025), Possible failure while deleting iSCSI_VM1_Disk1 from the source Storage Domain iSCSI_SD2 during the move operation. The Storage Domain may be manually cleaned-up from possible leftovers (User:admin@internal-authz).

At the end "iSCSI_SD2" has 4 GiB available, which is below the 5 GiB Critical Space Action Blocker.

After the fix, the RemoveSnapshotCommand executes successfully:

2022-08-24 11:00:43,164+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [] EVENT_ID: IRS_DISK_SPACE_LOW_ERROR(201), Critical, Low disk space. iSCSI_SD2 domain has 4 GB of free space.

We fall below threshold and then automatically recover, 7.5 GiB is returned to the Storage Domain.
At the end "iSCSI_SD2" correctly has 12 GiB available, which is above the 5 GiB Critical Space Action Blocker.

Signed-off-by: Pavel Bar pbar@redhat.com

@barpavel
Copy link
Member Author

/ost

@barpavel barpavel removed the verified label Aug 24, 2022
@barpavel barpavel marked this pull request as draft August 24, 2022 12:50
@barpavel barpavel force-pushed the no_auto_generated_snapshot_cleanup branch from 670a767 to d0d8f7a Compare August 24, 2022 13:38
@barpavel barpavel changed the title Skip thresholds check for Live Storage Migration remove snapshot Skip thresholds validation for remove snapshot Aug 25, 2022
@barpavel barpavel force-pushed the no_auto_generated_snapshot_cleanup branch from d0d8f7a to 21599db Compare August 25, 2022 11:39
@barpavel barpavel marked this pull request as ready for review August 25, 2022 11:46
@barpavel
Copy link
Member Author

/ost

Try to remove snapshot even in case of low Storage Domain disk space.
Otherwise, when remove snapshot is performed  at the end of Live Storage
Migration operation, the "RemoveSnapshotCommand" fails on validation
and leaves 3 chunks' snapshot (7.5 GiB) unreleased. Additionally, the LSM
operation is reported as failed, though the "move" part actually succeeded.
Due to extra 3 chunks that are temporary used during LSM flow, we
might temporary fall below disk space threshold.
Currently not only that the code leaves unreleased an almost 8 GiB "junk",
but it also leaves the Storage Domain in unhealthy low-space state
which blocks other operations and requires a manual intervention
instead of cleanly recovering from this temporary low disk space
situation by performing the  proper cleanup.

Before the fix:
  2022-08-23 13:14:56,448+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [] EVENT_ID: IRS_DISK_SPACE_LOW_ERROR(201), Critical, Low disk space. iSCSI_SD2 domain has 4 GB of free space.
  2022-08-23 13:16:10,455+03 WARN  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [03ef74cf-ddbe-4027-805f-02631bf96929] Validation of action 'RemoveSnapshot' failed for user admin@internal-authz. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_DISK_SPACE_LOW_ON_STORAGE_DOMAIN,$storageName iSCSI_SD2
  2022-08-23 13:16:15,555+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-17) [03ef74cf-ddbe-4027-805f-02631bf96929] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
  2022-08-23 13:16:15,581+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-17) [03ef74cf-ddbe-4027-805f-02631bf96929] EVENT_ID: USER_MOVE_IMAGE_GROUP_FAILED_TO_DELETE_SRC_IMAGE(2,025), Possible failure while deleting iSCSI_VM1_Disk1 from the source Storage Domain iSCSI_SD2 during the move operation. The Storage Domain may be manually cleaned-up from possible leftovers (User:admin@internal-authz).
At the end "iSCSI_SD2" has 4 GiB available,
which is below the 5 GiB "Critical Space Action Blocker".

After the fix "RemoveSnapshotCommand" executes successfully:
  2022-08-24 11:00:43,164+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-48) [] EVENT_ID: IRS_DISK_SPACE_LOW_ERROR(201), Critical, Low disk space. iSCSI_SD2 domain has 4 GB of free space.

We fall below threshold and then automatically recover,
7.5 GiB is returned to the Storage Domain.
At the end "iSCSI_SD2" correctly has 12 GiB available,
which is above the 5 GiB "Critical Space Action Blocker".

Signed-off-by: Pavel Bar <pbar@redhat.com>
@ahadas ahadas force-pushed the no_auto_generated_snapshot_cleanup branch from 21599db to 98b1b38 Compare August 25, 2022 13:13
@barpavel
Copy link
Member Author

/ost

1 similar comment
@barpavel
Copy link
Member Author

/ost

@ahadas ahadas merged commit a952c95 into oVirt:master Aug 28, 2022
@barpavel barpavel deleted the no_auto_generated_snapshot_cleanup branch August 28, 2022 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants