-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare for eliminating delay on extend #157
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nirs
added
enhancement
Enhancing the system by adding new feature or improving performance or reliability
cleanup
Code change keeping current behavior
storage
virt
labels
May 1, 2022
/ost |
nirs
added
the
verified
Change was tested; please describe how it was tested in the PR
label
May 1, 2022
Tested extend flows:
|
aesteve-rh
previously approved these changes
May 2, 2022
vjuranek
reviewed
May 2, 2022
On each monitor cycle, we need to get block stats from libvirt and updated the block info for every drive that needs monitoring. Based on the block info, we decide if the drive should be extended. Separate the block info update from the extend check. This makes the code easier to change and understand. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Previously we returned True if some drives were extended, and ENOSPC handler logged a message if some drives were extended. This log is not very useful. Simplify the interface by not returning anything. This will allow handling the monitoring request in the executor thread pool, and it simplifies the monitoring code. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When a vm pauses because of I/O error, we need to mark the drive for extension. Look up the drive early and pass it to the helper for sending event, instead of looking up the drive in the helper. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When a VM is paused because of ENOSPC, call VolumeMonitor.on_enospc() instead of VolumeMonitor.monitor_volume(). This will make it easier to unify the way we handle block threshold events and I/O errors. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
VM.monitor_volumes() was called only from the periodic job, which already access vm.volume_monitor for checking if monitoring is needed. Remove the unneeded method and change the periodic job and the tests to access vm.volume_monitor.monitor_volumes(). Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We always use block threshold events. When the flag to disable block threshold event was removed, the comment was not updated. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Previously calling the event handler twice would reset the exceeded time and log bogus message. We should reset the exceeded time and log only when setting the block threshold to exceeded on the first time. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Previously we called monitor_volumes(), but this does nothing unless a drive block threshold is UNSET or EXCEEDED. If the guest is writing fast, it may exceed the block threshold before we set it, and this case we will never get a block threshold event from libvirt. This can lead to paused VM with full disk that will never be extended. Since the disk is full, resuming the VM fails immediately with ENOSPC. But since block threshold was already exceeded, qemu does not submit a block threshold event. The only way to recover is to shutdown the VM. Fixed by unifying the way we handle ENOSPC and block threshold event; We mark a drive block threshold as EXCEEDED in both cases. The periodic monitor will pick up the drive in the next monitoring cycle. Since on_enospc() cannot handle the case of ENOSPC without a drive, we call it only when know the drive. Fixing monitor_volumes() to handle this case is possible but require a big change, and I don't think this case actually happens. This change have nice side effects: - Avoid the issue of blocking the libvirt event loop when calling monitor_volumes() and storage APIs block. - Avoid noise in the logs when libvirt issue multiple ENOSPC events for the same drive. We log changes in drive block threshold only once. This change introduces up to 2 seconds delay when resuming a paused VM. However, since the VM already paused, the delay is not critical. This delay will be eliminated when we eliminate the delay when receiving block threshold event. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When finding that a drive need extension we force drive threshold state to exceeded to make sure we continue to monitor this drive until it is extended. This is good, but the way this was implemented is wrong, clearing the exceeded time. Use Drive.on_block_threshold() which does the right thing now. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When libvirt report that the threshold was exceeded: 2022-04-09 23:30:01,179+0300 INFO (libvirt/events) [virt.vm] (vmId='724d41da-0b01-4d58-8a73-113906da1565') Block threshold 3221225472 exceeded by 196608 for drive 'sdb[1]' querying block status may show that allocation is smaller than the threshold: 2022-04-09 23:30:01,186+0300 DEBUG (periodic/1) [virt.vm] (vmId='724d41da-0b01-4d58-8a73-113906da1565') Extension info for drive sdb volume e1b76177-d925-497d-93ff-05ccb980f32a: BlockInfo(index=1, name='sdb', allocation=3213557760, capacity=53687091200, physical=5368709120, threshold=0) This happens when trying to extend a disk immediately after receiving a block threshold event. Change the check to always extend drives when threshold state is EXCEEDED without checking the amount of free space. Since we always extend, we don't need the workaround for libvirt bug when allocation is not reported. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When handling block threshold or enospc events, return True if the event change the drive threshold state to EXCEEDED. This can be used by callers to detect when a drive should be extended immediately. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Move comments from the docstring into the function to keep the comments closer to the code they explain. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Because we always mark a drive for extension, there is no point to log the threshold state. Unify the text "Requesting an extension" to match similar logs about extending the volume and the replica. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
FakeVM block_stats and query_block_stats are not used now. They were left by mistake when tests using them were removed. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
If we detect an improbable allocation value, we paused the VM and raise. Improve this implementation to make it easier to work with. - Rename the exception to describe the issue better - Remove the error log since we raise. The top level error handler will log a traceback with this error. - Don't handle this error in monitor_volumes(), so the error will propagate to the top level error handler. - Use fstring to format the exception instead of fragile %. - Simplify the exception message, all the important details are already included in block_info named tuple. - Add the missing test to make sure this behavior is not modified by mistake. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We use the allocation reported by libvirt only in when checking drives before getting a block threshold event, so we better check the allocation only when we we need it. Getting improbable allocation in other cases does not affect anything. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
vjuranek
approved these changes
May 3, 2022
aesteve-rh
approved these changes
May 3, 2022
bennyz
approved these changes
May 4, 2022
mz-pdm
approved these changes
May 4, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cleanup
Code change keeping current behavior
enhancement
Enhancing the system by adding new feature or improving performance or reliability
storage
verified
Change was tested; please describe how it was tested in the PR
virt
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the first part from #124, preparing for eliminating the the delay
before extending drives (#85). The changes were reviewed in the last 3 weeks
and should be ready for merge.
Fixes:
Improvements:
Refactoring: