Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: Increase thin provisioning threshold #82

Merged
merged 4 commits into from Feb 23, 2022

Conversation

nirs
Copy link
Member

@nirs nirs commented Feb 22, 2022

Collecting extend stats shows that extend takes between 2.2 to 6.2
seconds, with average of 3.7 seconds. With the default thresholds:

[irs]
volume_utilization_chunk_mb = 1024
volume_utilization_percent = 50

This means that we extend the volume when free space is 512 MiB. Writing
more than 512 MiB in 3.7 seconds (138.4 MiB/s) will cause the VM to
pause with ENOSPC.

This configuration was too low 10 years ago, and we need to update it
for modern storage. Update the values to allow 4 times faster writes
before we pause with ENOSPC.

With the new configuration:

[irs]
volume_utilization_chunk_mb = 2560
volume_utilization_percent = 20

We extend the volume when free space is 2048 MiB.

Testing with old and new configuration show that we can cope now with 4x
times faster write rate before we VMs pause during extend.

Before:

write rate extends pauses

75 MiB/s 50 0
100 MiB/s 50 4
125 MiB/s 50 4
150 MiB/s 53 24

After:

write rate extends pauses

200 MiB/s 20 0
250 MiB/s 20 0
300 MiB/s 20 0
350 MiB/s 21 0
400 MiB/s 20 1
450 MiB/s 20 2
500 MiB/s 22 7
550 MiB/s 23 7

The downside of this change is allocating more space in the storage
domain. New empty disk will consume 2.5 GiB instead of 1 GiB.

Bug-Url: https://bugzilla.redhat.com/2051997
Signed-off-by: Nir Soffer nsoffer@redhat.com

@nirs nirs added verified Change was tested; please describe how it was tested in the PR storage labels Feb 22, 2022
@nirs nirs requested a review from tinez as a code owner February 22, 2022 10:12
bennyz
bennyz previously approved these changes Feb 22, 2022
@nirs
Copy link
Member Author

nirs commented Feb 22, 2022

Some storage tests fail, I guess some wrong tests assumes the old chunk size instead of
using mocked config.

@nirs nirs marked this pull request as draft February 22, 2022 10:57
@michalskrivanek
Copy link
Member

can you elaborate a little bit of what are the factors affecting the time it takes to extend? I mean besides the actual storage speed. Are there any waits in the process, communication with SPM, etc?

@oVirt oVirt deleted a comment from ovirt-infra Feb 22, 2022
@michalskrivanek
Copy link
Member

also, is the write speed as perceived by the guest directly corresponding to the actual physical write speed? w're using O_DIRECT everywhere so I would assume yes? So we can roughly estimate the minimal values for a concrete underlying write speed? e.g. if we try to measure the write speed outside of oVirt with fio or even dd or something...

@nirs
Copy link
Member Author

nirs commented Feb 22, 2022

can you elaborate a little bit of what are the factors affecting the time it takes to extend? I mean besides the actual storage speed. Are there any waits in the process, communication with SPM, etc?

The time include lot of waiting since we use polling.

  • libvirt event thread get a block threshold event, and mark the drive
    for extension
  • The periodic watermark monitor checks vms every 2 seconds. When it find
    that a drive needs extension, it sends a request to the spm by writing
    to the storage mailbox
  • The SPM check storage every 2 seconds. When it find the request, it
    run the extend using the spm mailbox thread pool
  • The host is polling its mailbox for replies every 2 seconds. When it detects
    the reply, it complete the extend on the host side and resume the vm if
    needed.

We cannot optimize much the storage mailbox since checking the mailbox requires
reading from storage, and many hosts may check the mailbox at the same time.
Maybe we can check every 1 second instead of 2.

We can optimize the block threshold event handling - it should really post an
event that will wake up the watermark monitor immediately, and start the extend
flow. This will save 0-2 seconds from the total time. This requires rewrite
of the watermark monitor and separating it from the periodic executor which
is a change I wanted to do for long time.

Copy link
Member

@vjuranek vjuranek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linter says that from vdsm.config import config in blockvolume_test is unsed now and fails

@nirs
Copy link
Member Author

nirs commented Feb 22, 2022

also, is the write speed as perceived by the guest directly corresponding to the actual physical write speed? w're using O_DIRECT everywhere so I would assume yes? So we can roughly estimate the minimal values for a concrete underlying write speed? e.g. if we try to measure the write speed outside of oVirt with fio or even dd or something...

Yes, we use direct I/O, and the extend script is using direct I/O inside the
guest, so what we write inside the guest is exactly what is writen to the
actual storage.

Measuring write throughput should be done in the guest, there is big difference
between guest performance and host performance. Measuring is complected and it
is impossible to predict how fast the disk will grow on a system.

@nirs
Copy link
Member Author

nirs commented Feb 22, 2022

linter says that from vdsm.config import config in blockvolume_test is unsed now and fails

flake8 is correct, fixed in current version.

@michalskrivanek
Copy link
Member

Measuring write throughput should be done in the guest, there is big difference
between guest performance and host performance. Measuring is complected and it
is impossible to predict how fast the disk will grow on a system.

yes, but isn't it just that it's always somewhat slower in guest, never faster? In that case it is still useful as then we can measure easily on the host and estimate for the worst case of near-host performance in guest

@nirs
Copy link
Member Author

nirs commented Feb 22, 2022

Measuring write throughput should be done in the guest, there is big difference
between guest performance and host performance. Measuring is complected and it
is impossible to predict how fast the disk will grow on a system.

yes, but isn't it just that it's always somewhat slower in guest, never faster? In that case it is still useful as then we can measure easily on the host and estimate for the worst case of near-host performance in guest

I don't see how it can be faster in the guest. But it is not possible to measure the host
and assume that the measurement is correct for the future, because the load on the storage,
the network and the host at the time of the measurement can be different at the time the
guest use the storage.

@nirs nirs marked this pull request as ready for review February 22, 2022 16:16
@nirs
Copy link
Member Author

nirs commented Feb 22, 2022

@vjuranek should be ready now.

@nirs nirs requested a review from bennyz February 22, 2022 16:17
Test depending on configuration options must use mock config object to
avoid failing when configuration is modified, or when running on a host
with non default config.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Set threshold test was using hard coded value assuming old vdsm
configuration. Change the test to use vdsm configuration so it does not
break when the configuration is changed.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
This horrible test was depending on default configuration instead of
using the config, and is written in a way that make it hard to use the
config.

The horrible make_env() context manager using the crappy storagetestlib
was mocking everything after creating the volumes that need mocking to
consider the configuration. Change to make create everything inside the
mock context.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Collecting extend stats shows that extend takes between 2.2 to 6.2
seconds, with average of 3.7 seconds. With the default thresholds:

[irs]
volume_utilization_chunk_mb = 1024
volume_utilization_percent = 50

This means that we extend the volume when free space is 512 MiB. Writing
more than 512 MiB in 3.7 seconds (138.4 MiB/s) will cause the VM to
pause with ENOSPC.

This configuration was too low 10 years ago, and we need to update it
for modern storage. Update the values to allow 4 times faster writes
before we pause with ENOSPC.

With the new configuration:

[irs]
volume_utilization_chunk_mb = 2560
volume_utilization_percent = 20

We extend the volume when free space is 2048 MiB.

Testing with old and new configuration show that we can cope now with 4x
times faster write rate before we VMs pause during extend.

Before:

write rate  extends   pauses
----------------------------
 75 MiB/s        50        0
100 MiB/s        50        4
125 MiB/s        50        4
150 MiB/s        53       24

After:

write rate  extends   pauses
----------------------------
200 MiB/s        20        0
250 MiB/s        20        0
300 MiB/s        20        0
350 MiB/s        21        0
400 MiB/s        20        1
450 MiB/s        20        2
500 MiB/s        22        7
550 MiB/s        23        7

The downside of this change is allocating more space in the storage
domain. New empty disk will consume 2.5 GiB instead of 1 GiB.

Bug-Url: https://bugzilla.redhat.com/2051997
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@sonarcloud
Copy link

sonarcloud bot commented Feb 22, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@michalskrivanek
Copy link
Member

4x is a big improvement.
The extra overhead is small enough, 1.5GB extra for today's SD sizes is worth it.

Still, with a bit over 6 seconds it means we will still eventually pause with >300MiB/s writes even in quiet conditions. Is it high enough? Should we go for 500 increasing the overhead? OTOH it's really not a big deal to pause occasionally, we just want to avoid frequent pauses every time e.g. etcd is extended.

@mykaul FYI

@mykaul
Copy link

mykaul commented Feb 22, 2022

4x is a big improvement. The extra overhead is small enough, 1.5GB extra for today's SD sizes is worth it.

Still, with a bit over 6 seconds it means we will still eventually pause with >300MiB/s writes even in quiet conditions. Is it high enough? Should we go for 500 increasing the overhead? OTOH it's really not a big deal to pause occasionally, we just want to avoid frequent pauses every time e.g. etcd is extended.

@mykaul FYI

Looks good - time to change the defaults indeed.

@vjuranek vjuranek merged commit a90335c into oVirt:master Feb 23, 2022
@nirs nirs deleted the thinp-defaults branch March 20, 2022 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
storage verified Change was tested; please describe how it was tested in the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants