Minimize wait after receiving block threshold event #85

nirs · 2022-02-23T10:43:22Z

In #82 we improved the defaults to avoid pauses during volume extension, but
we have room for more improvements.

The extend flow contains several steps:

libvirt event thread get a block threshold event, and mark the drive for
extension
The periodic watermark monitor checks vms every 2 seconds. When it find
that a drive needs extension, it sends a request to the spm by writing
to the storage mailbox
The SPM check storage every 2 seconds. When it find the request, it run
the extend using the spm mailbox thread pool
The host is polling its mailbox for replies every 2 seconds. When it
detects the reply, it complete the extend on the host side and resume the vm if needed.

When we look at logs, we see:

2022-02-22 10:35:39,269+0200 INFO  (mailbox-hsm/3) [virt.vm] 
(vmId='d0730833-98aa-4138-94d1-8497763807c7') Extend volume 
61ea7614-c93e-48ca-a00f-b53cf96f94e9 completed 
<Clock(total=4.18, wait=1.93, extend-volume=2.09, refresh-volume=0.15)>
(thinp:567)

We waited 1.93 seconds from the time the event was received until we
handled it. A guest using fast storage can write 1.5 GiB during this wait
and pause with ENOSPC. This was 46% of total extend time.

The wait is caused by the monitor interval - 2 seconds. If the periodic
executor is blocked on slow libvirt calls, it can take much more time.

When the executor runs the periodic watermark monitor, it checks all the vms,
and it can get blocked on another unresponsive vm, even if we are lucky and
it runs quickly after we received the event.

What we want to do is to handle the block threshold event immediately, avoiding
the 0-2 seconds delay (or more in bad cases). But we don't want to do this on
the libvirt/events threads since handling block threshold access libvirt and
storage layer, and it may blcok for long time in bad cases, delaying other
libvirt events.

I think the best way to handle it is to dispatch a call to VM.monitor_volumes()
on the periodic executor when we receive an event.

Changes needed:

add periodic.dispatch() API, for running operations immediately outside of
the normal monitoring schedule.
use periodic.dispatch() in VolumeMonitor.on_block_threshold() to schedule
call to VM.monitor_volumes() soon.
Add locking in VolumeMontior.monitor_volumes() to ensure that we don't have
multiple threads monitoring at the same time.

Once we have this, we can increase the monitoring interval, since is is needed
only as a backup if an extend request did not finish.

The text was updated successfully, but these errors were encountered:

Minimize message latency by introducing an event mechanism. The unused host 0 mailbox is used now for sending and receiving change events. When a host send mail to the SPM, it writes an event to the event block. The SPM monitor this block 10 times per monitor interval, so it can detect that a new message is ready quickly. The HSM mail monitor monitor its inbox 10 times per monitor interval, so it can detect replies quicker. With both changes, sending an extend request latency reduced from 2-4 seconds to 0.2-0.4 seconds, reducing the risk of pausing VM when writing quickly to fast storage. Testing shows that we can write now 550 MiB/s without pausing a VM when the disk is extended. Before this change, writing we could write only 350 MiB/s without pausing. Here are example logs form this run, showing that total extend time is 1.16-3.14 seconds instead of 2.5-8.3 seconds. <Clock(total=1.16, wait=0.30, extend-volume=0.47, refresh-volume=0.38)> <Clock(total=2.43, wait=1.30, extend-volume=0.73, refresh-volume=0.40)> <Clock(total=1.60, wait=0.81, extend-volume=0.48, refresh-volume=0.30)> <Clock(total=2.61, wait=1.59, extend-volume=0.69, refresh-volume=0.32)> <Clock(total=1.80, wait=0.66, extend-volume=0.76, refresh-volume=0.38)> <Clock(total=3.14, wait=1.89, extend-volume=0.74, refresh-volume=0.51)> <Clock(total=2.17, wait=1.09, extend-volume=0.71, refresh-volume=0.37)> <Clock(total=1.35, wait=0.15, extend-volume=0.70, refresh-volume=0.51)> <Clock(total=2.43, wait=1.32, extend-volume=0.76, refresh-volume=0.35)> <Clock(total=1.76, wait=0.64, extend-volume=0.75, refresh-volume=0.36)> <Clock(total=2.74, wait=1.61, extend-volume=0.76, refresh-volume=0.37)> <Clock(total=2.01, wait=0.72, extend-volume=0.98, refresh-volume=0.31)> <Clock(total=2.35, wait=1.53, extend-volume=0.53, refresh-volume=0.30)> <Clock(total=1.88, wait=0.79, extend-volume=0.75, refresh-volume=0.34)> <Clock(total=1.26, wait=0.10, extend-volume=0.76, refresh-volume=0.40)> <Clock(total=1.90, wait=0.75, extend-volume=0.78, refresh-volume=0.38)> <Clock(total=3.06, wait=1.87, extend-volume=0.77, refresh-volume=0.42)> <Clock(total=1.84, wait=0.68, extend-volume=0.75, refresh-volume=0.41)> <Clock(total=2.54, wait=1.74, extend-volume=0.51, refresh-volume=0.29)> <Clock(total=2.25, wait=1.08, extend-volume=0.70, refresh-volume=0.47)> The largest issue now is the wait time; in the worst case, we waited 1.89 seconds before sending an extend request, which is 60% of the total extend time (3.14). This issue is tracked in oVirt#85. Fixes oVirt#102. Signed-off-by: Nir Soffer <nsoffer@redhat.com>

Minimize message latency by introducing an event mechanism. The unused host 0 mailbox is used now for sending and receiving change events. When a host send mail to the SPM, it writes an event to the event block. The SPM monitor this block every eventInterval (0.5 second) between monitor intervals, so it can detect that a new message is available quickly. The HSM mail monitor monitor its inbox every eventInterval (0.5 seconds) when it is waiting for replies, so it can detect replies quickly. With both changes, sending an extend request latency reduced from 2.0-4.0 seconds to 0.5-1.0 seconds, reducing the risk of pausing VM when writing quickly to fast storage. Reducing event interval increases vdsm CPU usage, since we use dd to read events. To improve this, we need to add a helper process for checking events and reading mailbox data. Testing shows that we can write now 525 MiB/s (the maximum rate on my nested test environment) in the guest without pausing a VM when the disk is extended. Before this change, we could write only 350 MiB/s before the VM starts to pause randomly during the test. Here are example logs form this run, showing that total extend time is 1.14-3.31 seconds instead of 2.5-8.3 seconds. <Clock(total=1.65, wait=0.28, extend-volume=1.09, refresh-volume=0.28)> <Clock(total=2.67, wait=1.80, extend-volume=0.58, refresh-volume=0.29)> <Clock(total=3.10, wait=1.74, extend-volume=1.10, refresh-volume=0.25)> <Clock(total=2.85, wait=1.55, extend-volume=1.08, refresh-volume=0.22)> <Clock(total=2.02, wait=1.14, extend-volume=0.58, refresh-volume=0.30)> <Clock(total=1.14, wait=0.33, extend-volume=0.56, refresh-volume=0.25)> <Clock(total=2.83, wait=1.42, extend-volume=1.09, refresh-volume=0.32)> <Clock(total=1.68, wait=0.33, extend-volume=1.10, refresh-volume=0.25)> <Clock(total=2.45, wait=1.47, extend-volume=0.60, refresh-volume=0.38)> <Clock(total=1.44, wait=0.11, extend-volume=1.09, refresh-volume=0.24)> <Clock(total=2.46, wait=1.04, extend-volume=1.13, refresh-volume=0.30)> <Clock(total=1.55, wait=0.17, extend-volume=1.07, refresh-volume=0.31)> <Clock(total=2.02, wait=1.13, extend-volume=0.60, refresh-volume=0.28)> <Clock(total=1.75, wait=0.39, extend-volume=1.12, refresh-volume=0.24)> <Clock(total=2.98, wait=1.61, extend-volume=1.12, refresh-volume=0.25)> <Clock(total=1.28, wait=0.41, extend-volume=0.61, refresh-volume=0.25)> <Clock(total=2.46, wait=1.48, extend-volume=0.62, refresh-volume=0.36)> <Clock(total=3.31, wait=1.97, extend-volume=1.12, refresh-volume=0.21)> <Clock(total=1.81, wait=0.96, extend-volume=0.58, refresh-volume=0.27)> <Clock(total=2.74, wait=1.87, extend-volume=0.58, refresh-volume=0.29)> I tested also a shorter eventInterval (0.2 seconds). This reduces the extend time by 20%, but doubles the cpu usage on the SPM mailbox. With this change, the next improvement is eliminating the wait time. In the worst case, we waited 1.97 seconds before sending an extend request, which is 68% of the total extend time (3.31 seconds). This issue is tracked in oVirt#85. Fixes oVirt#102. Signed-off-by: Nir Soffer <nsoffer@redhat.com>

Minimize message latency by introducing mailbox events mechanism. The unused host 0 mailbox is used now for sending and receiving mailbox events. To control mailbox events, a new "mailbox:events_enable" was added. The option is disabled by default, so we can test this change before we enable it by default, or disable it in production if needed. To enable mailbox events, add a drop-in configuration file to all hosts: $ cat /etc/vdsm/vdsm.conf.d/99-local.conf [mailbox] events_enable = true And restart the vdsmd service. When mailbox:events_enable option is enabled: - Hosts write an event to host 0 mailbox after sending mail to the SPM. - The SPM monitors host 0 mailbox every eventInterval (0.5 seconds) between monitor intervals, so it can handle new messages quickly. - When hosts wait for reply from the SPM, they monitor their inbox every eventInterval (0.5 seconds), so they detect the reply quickly. - Host reports a new "mailbox_events" capability. This can be used by engine to optimize mailbox I/O when all hosts in a data center supports this capability. With this change, extend roundtrip latency reduced from 2.0-4.0 seconds to 0.5-1.0 seconds, reducing the risk of pausing VM when writing quickly to fast storage. Testing shows that we can write now 525 MiB/s (the maximum rate on my nested test environment) in the guest without pausing a VM when the disk is extended. Before this change, we could write only 350 MiB/s before the VM starts to pause randomly during the test. Here are example logs form this run, showing that total extend time is 1.14-3.31 seconds instead of 2.5-8.3 seconds before this change. <Clock(total=1.65, wait=0.28, extend-volume=1.09, refresh-volume=0.28)> <Clock(total=2.67, wait=1.80, extend-volume=0.58, refresh-volume=0.29)> <Clock(total=3.10, wait=1.74, extend-volume=1.10, refresh-volume=0.25)> <Clock(total=2.85, wait=1.55, extend-volume=1.08, refresh-volume=0.22)> <Clock(total=2.02, wait=1.14, extend-volume=0.58, refresh-volume=0.30)> <Clock(total=1.14, wait=0.33, extend-volume=0.56, refresh-volume=0.25)> <Clock(total=2.83, wait=1.42, extend-volume=1.09, refresh-volume=0.32)> <Clock(total=1.68, wait=0.33, extend-volume=1.10, refresh-volume=0.25)> <Clock(total=2.45, wait=1.47, extend-volume=0.60, refresh-volume=0.38)> <Clock(total=1.44, wait=0.11, extend-volume=1.09, refresh-volume=0.24)> <Clock(total=2.46, wait=1.04, extend-volume=1.13, refresh-volume=0.30)> <Clock(total=1.55, wait=0.17, extend-volume=1.07, refresh-volume=0.31)> <Clock(total=2.02, wait=1.13, extend-volume=0.60, refresh-volume=0.28)> <Clock(total=1.75, wait=0.39, extend-volume=1.12, refresh-volume=0.24)> <Clock(total=2.98, wait=1.61, extend-volume=1.12, refresh-volume=0.25)> <Clock(total=1.28, wait=0.41, extend-volume=0.61, refresh-volume=0.25)> <Clock(total=2.46, wait=1.48, extend-volume=0.62, refresh-volume=0.36)> <Clock(total=3.31, wait=1.97, extend-volume=1.12, refresh-volume=0.21)> <Clock(total=1.81, wait=0.96, extend-volume=0.58, refresh-volume=0.27)> <Clock(total=2.74, wait=1.87, extend-volume=0.58, refresh-volume=0.29)> I tested also a shorter eventInterval (0.2 seconds). This reduces the extend time by 20%, but doubles the cpu usage on the SPM mailbox. With this change, the next improvement is eliminating the wait time. In the worst case, we waited 1.97 seconds before sending an extend request, which is 68% of the total extend time (3.31 seconds). This issue is tracked in oVirt#85. Fixes oVirt#102. Signed-off-by: Nir Soffer <nsoffer@redhat.com>

Minimize message latency by introducing mailbox events mechanism. The unused host 0 mailbox is used now for sending and receiving mailbox events. To control mailbox events, a new "mailbox:events_enable" was added. The option is disabled by default, so we can test this change before we enable it by default, or disable it in production if needed. To enable mailbox events, add a drop-in configuration file to all hosts: $ cat /etc/vdsm/vdsm.conf.d/99-local.conf [mailbox] events_enable = true And restart the vdsmd service. When mailbox:events_enable option is enabled: - Hosts write an event to host 0 mailbox after sending mail to the SPM. - The SPM monitors host 0 mailbox every eventInterval (0.5 seconds) between monitor intervals, so it can handle new messages quickly. - When hosts wait for reply from the SPM, they monitor their inbox every eventInterval (0.5 seconds), so they detect the reply quickly. - Host reports a new "mailbox_events" capability. This can be used by engine to optimize mailbox I/O when all hosts in a data center supports this capability. With this change, extend roundtrip latency reduced from 2.0-4.0 seconds to 0.5-1.0 seconds, reducing the risk of pausing VM when writing quickly to fast storage. Testing shows that we can write now 525 MiB/s (the maximum rate on my nested test environment) in the guest without pausing a VM when the disk is extended. Before this change, we could write only 350 MiB/s before the VM starts to pause randomly during the test. Here are example logs form this run, showing that total extend time is 1.14-3.31 seconds instead of 2.5-8.3 seconds before this change. <Clock(total=1.65, wait=0.28, extend-volume=1.09, refresh-volume=0.28)> <Clock(total=2.67, wait=1.80, extend-volume=0.58, refresh-volume=0.29)> <Clock(total=3.10, wait=1.74, extend-volume=1.10, refresh-volume=0.25)> <Clock(total=2.85, wait=1.55, extend-volume=1.08, refresh-volume=0.22)> <Clock(total=2.02, wait=1.14, extend-volume=0.58, refresh-volume=0.30)> <Clock(total=1.14, wait=0.33, extend-volume=0.56, refresh-volume=0.25)> <Clock(total=2.83, wait=1.42, extend-volume=1.09, refresh-volume=0.32)> <Clock(total=1.68, wait=0.33, extend-volume=1.10, refresh-volume=0.25)> <Clock(total=2.45, wait=1.47, extend-volume=0.60, refresh-volume=0.38)> <Clock(total=1.44, wait=0.11, extend-volume=1.09, refresh-volume=0.24)> <Clock(total=2.46, wait=1.04, extend-volume=1.13, refresh-volume=0.30)> <Clock(total=1.55, wait=0.17, extend-volume=1.07, refresh-volume=0.31)> <Clock(total=2.02, wait=1.13, extend-volume=0.60, refresh-volume=0.28)> <Clock(total=1.75, wait=0.39, extend-volume=1.12, refresh-volume=0.24)> <Clock(total=2.98, wait=1.61, extend-volume=1.12, refresh-volume=0.25)> <Clock(total=1.28, wait=0.41, extend-volume=0.61, refresh-volume=0.25)> <Clock(total=2.46, wait=1.48, extend-volume=0.62, refresh-volume=0.36)> <Clock(total=3.31, wait=1.97, extend-volume=1.12, refresh-volume=0.21)> <Clock(total=1.81, wait=0.96, extend-volume=0.58, refresh-volume=0.27)> <Clock(total=2.74, wait=1.87, extend-volume=0.58, refresh-volume=0.29)> I tested also a shorter eventInterval (0.2 seconds). This reduces the extend time by 20%, but doubles the cpu usage on the SPM mailbox. With this change, the next improvement is eliminating the wait time. In the worst case, we waited 1.97 seconds before sending an extend request, which is 68% of the total extend time (3.31 seconds). This issue is tracked in #85. Fixes #102. Signed-off-by: Nir Soffer <nsoffer@redhat.com>

When receiving a block threshold event or when pausing because of ENOSPC, extend the drive as soon as possible on the periodic executor. To reuse the periodic monitor infrastructure, the VolumeWatermarkMonitor provides now a dispatch() class method. It runs monitor_volumes() with urgent=True on the periodic executor to ensure that this invocation will extend drives immediately, even if the last extend started less than 2.0 seconds ago. This change decrease the wait before sending an extend request from 0-2 seconds to 10 milliseconds, and the total time to extend to 0.8-1.3 seconds. With this we can write up to 1300 MiB/s to a thin disk without pausing the VM. XXX stats with this change Unfinished: - Some tests fail because the periodic executor is not during the tests. Fixes oVirt#85 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

When receiving a block threshold event or when pausing because of ENOSPC, extend the drive as soon as possible on the periodic executor. To reuse the periodic monitor infrastructure, the VolumeWatermarkMonitor provides now a dispatch() class method. It runs monitor_volumes() with urgent=True on the periodic executor to ensure that this invocation will extend drives immediately, even if the last extend started less than 2.0 seconds ago. This change decrease the wait before sending an extend request from 0.0-2.0 seconds to 10 milliseconds, and the total time to extend to 0.66-1.30 seconds. With this we can write 50 GiB at rate of 1320 MiB/s[1] to a thin disk without pausing the VM. The theoretical limit is 1538 MiB/s but my NVMe drive is not fast enough. Extend stats with this change: | time | min | avg | max | |-------------|-------|-------|-------| | total | 0.66 | 0.97 | 1.30 | | extend | 0.53 | 0.79 | 1.11 | | refresh | 0.08 | 0.18 | 0.23 | | wait | 0.01 | 0.01 | 0.01 | Unfinished: - Some tests fail because the periodic executor is not during the tests. - Need to add tests for new behavior Fixes oVirt#85 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

When receiving a block threshold event or when pausing because of ENOSPC, extend the drive as soon as possible on the periodic executor. To reuse the periodic monitor infrastructure, the VolumeWatermarkMonitor provides now a dispatch() class method. It runs monitor_volumes() with urgent=True on the periodic executor to ensure that this invocation will extend drives immediately, even if the last extend started less than 2.0 seconds ago. This change decrease the wait before sending an extend request from 0.0-2.0 seconds to 10 milliseconds, and the total time to extend to 0.66-1.30 seconds. With this we can write 50 GiB at rate of 1320 MiB/s[1] to a thin disk without pausing the VM. The theoretical limit is 1538 MiB/s but my NVMe drive is not fast enough. Extend stats with this change: | time | min | avg | max | |-------------|-------|-------|-------| | total | 0.66 | 0.97 | 1.30 | | extend | 0.53 | 0.79 | 1.11 | | refresh | 0.08 | 0.18 | 0.23 | | wait | 0.01 | 0.01 | 0.01 | Unfinished: - Some tests fail because the periodic executor is not during the tests. - Need to add tests for new behavior Fixes: oVirt#85 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

When receiving a block threshold event or when pausing because of ENOSPC, extend the drive as soon as possible on the periodic executor. To reuse the periodic monitor infrastructure, the VolumeWatermarkMonitor provides now a dispatch() class method. It runs monitor_volumes() with urgent=True on the periodic executor to ensure that this invocation will extend drives immediately, even if the last extend started less than 2.0 seconds ago. This change decrease the wait before sending an extend request from 0.0-2.0 seconds to 10 milliseconds, and the total time to extend to 0.66-1.30 seconds. With this we can write 50 GiB at rate of 1320 MiB/s to a thin disk without pausing the VM. The theoretical limit is 1538 MiB/s but my NVMe drive is not fast enough. Extend stats with this change: | time | min | avg | max | |-------------|-------|-------|-------| | total | 0.66 | 0.97 | 1.30 | | extend | 0.53 | 0.79 | 1.11 | | refresh | 0.08 | 0.18 | 0.23 | | wait | 0.01 | 0.01 | 0.01 | Unfinished: - Some tests fail because the periodic executor is not during the tests. - Need to add tests for new behavior Fixes: oVirt#85 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

Expose periodic.dispatch() function allowing immediate dispatching of calls on the periodic executor. This is useful when you want to handle libvirt events on the periodic executor. The first user of this facility is the thinp volume monitor. Now when we receive a block threshold or enospc events we use the periodic dispatch to extend the relevant drive immediately. This eliminates the 0-2 seconds wait after receiving an event. XXX Tests results with this change Fixes: oVirt#85 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

Expose periodic.dispatch() function allowing immediate dispatching of calls on the periodic executor. This is useful when you want to handle libvirt events on the periodic executor. The first user of this facility is the thinp volume monitor. Now when we receive a block threshold or enospc events we use the periodic dispatch to extend the relevant drive immediately. This eliminates the 0-2 seconds wait after receiving an event. Here are test results from 4 runs, each writing 50 GiB to think disk at ~1300 MiB/s. Each run extends the disk 20 times. The VM was not paused during the test. | time | min | avg | max | |-------------|-------|-------|-------| | total | 0.77 | 1.15 | 1.39 | | extend | 0.55 | 0.92 | 1.14 | | refresh | 0.16 | 0.22 | 0.31 | | wait | 0.01 | 0.01 | 0.03 | Fixes: oVirt#85 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

Expose periodic.dispatch() function allowing immediate dispatching of calls on the periodic executor. This is useful when you want to handle libvirt events on the periodic executor. The first user of this facility is the thinp volume monitor. Now when we receive a block threshold or enospc events we use the periodic dispatch to extend the relevant drive immediately. This eliminates the 0-2 seconds wait after receiving an event. Here are test results from 4 runs, each writing 50 GiB to think disk at ~1300 MiB/s. Each run extends the disk 20 times. The VM was not paused during the test. | time | min | avg | max | |-------------|-------|-------|-------| | total | 0.77 | 1.15 | 1.39 | | extend | 0.55 | 0.92 | 1.14 | | refresh | 0.16 | 0.22 | 0.31 | | wait | 0.01 | 0.01 | 0.03 | Fixes: #85 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

Expose periodic.dispatch() function allowing immediate dispatching of calls on the periodic executor. This is useful when you want to handle libvirt events on the periodic executor. The first user of this facility is the thinp volume monitor. Now when we receive a block threshold or enospc events we use the periodic dispatch to extend the relevant drive immediately. This eliminates the 0-2 seconds wait after receiving an event. Here are test results from 4 runs, each writing 50 GiB to think disk at ~1300 MiB/s. Each run extends the disk 20 times. The VM was not paused during the test. | time | min | avg | max | |-------------|-------|-------|-------| | total | 0.77 | 1.15 | 1.39 | | extend | 0.55 | 0.92 | 1.14 | | refresh | 0.16 | 0.22 | 0.31 | | wait | 0.01 | 0.01 | 0.03 | Fixes: oVirt#85 Signed-off-by: Nir Soffer <nsoffer@redhat.com>

nirs added enhancement Enhancing the system by adding new feature or improving performance or reliability storage virt labels Feb 23, 2022

nirs mentioned this issue Mar 20, 2022

mailbox: Minimize messages latency #103

Merged

nirs mentioned this issue Apr 11, 2022

Eliminate the wait before extending a volume #124

Merged

nirs self-assigned this Apr 19, 2022

nirs mentioned this issue May 1, 2022

Prepare for eliminating delay on extend #157

Merged

nirs closed this as completed in #124 May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize wait after receiving block threshold event #85

Minimize wait after receiving block threshold event #85

nirs commented Feb 23, 2022 •

edited

Loading

Minimize wait after receiving block threshold event #85

Minimize wait after receiving block threshold event #85

Comments

nirs commented Feb 23, 2022 • edited Loading

nirs commented Feb 23, 2022 •

edited

Loading