Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP; RFC, discussion, feedback] account for ashift when choosing buffers … #3451

Conversation

kernelOfTruth
Copy link
Contributor

…to be written to l2arc device

If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual increment
to l2ad_hand was introduced in
commit e14bb3258d05c1b1077e2db7cf77088924e56919

Also, consistently use asize / a_sz for the allocated size, psize / p_sz
for the physical size. Where the latter accounts for possible size
reduction because of compression, whereas the former accounts for possible
size expansion because of alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size. This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical I/O
has a suitable size.

References:
https://reviews.csiden.org/r/112/diff/1/#index_header
#3433

Ported-by:
kernelOfTruth kernelOfTruth@gmail.com

Fixes (supposedly):
zfsonlinux#3114
zfsonlinux#3400
[and potentially several others]

…to be written to l2arc device

If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual increment
to l2ad_hand was introduced in
commit e14bb3258d05c1b1077e2db7cf77088924e56919

Also, consistently use asize / a_sz for the allocated size, psize / p_sz
for the physical size.  Where the latter accounts for possible size
reduction because of compression, whereas the former accounts for possible
size expansion because of alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical I/O
has a suitable size.
@kernelOfTruth
Copy link
Contributor Author

@avg-I your updated patch reaped some great results:

#3400 (comment)

#3400 (comment)

Thanks 👍

@pzwahlen
Copy link

@kernelOfTruth Would it be possible to update this patch against current master ? I'm willing to run that stack for a few days...

[mockbuild@dev-c7 gitzfs]$ patch -p1 < 3451.patch  
patching file module/zfs/arc.c
Hunk #1 succeeded at 5776 with fuzz 1 (offset 755 lines).
Hunk #2 FAILED at 5050.
Hunk #3 FAILED at 5095.
Hunk #4 succeeded at 5865 with fuzz 1 (offset 755 lines).
Hunk #5 FAILED at 5132.
Hunk #6 succeeded at 5930 (offset 756 lines).
Hunk #7 succeeded at 5941 (offset 756 lines).
Hunk #8 succeeded at 5958 (offset 756 lines).
3 out of 8 hunks FAILED -- saving rejects to file module/zfs/arc.c.rej

Thanks again for that great work...

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Jun 12, 2015
The problem is that since OpenSolaris commit
illumos/illumos-gate@e14bb32
l2ad_hand is kept aligned based on ashift (which is derived from
the cache device's logical and physical block sizes).
So, the hand could be advanced by more than the sum of b_asize-s
of the written L2ARC buffers. This is because b_asize is a misnomer
at the moment as it does not always represent the allocated size:
if a buffer is compressed, then the compressed size is properly rounded,
but if the compression fails or it is not applied, then the original
size is kept and it could be smaller than what ashift requires.

For the same reasons arcstat_l2_asize and the reported used space
on the cache device could be smaller than the actual allocated size
if ashift > 9. That problem is not fixed by this change.

This change only ensures that l2ad_hand is not advanced by more
than target_sz. Otherwise we would overwrite active (unevicted)
L2ARC buffers. That problem is manifested as growing l2_cksum_bad
and l2_io_error counters.

This change also changes 'p' prefix to 'a' prefix in a few places
where variables represent allocated rather than physical size.

The resolved problem may also result in the reported allocated size
being greater than the cache device's capacity, because of the
overwritten buffers (more than one buffer claiming the same disk
space).

PR:	198242
PR:	195746 (possibly related)

Porting notes:

Rather difficult to track changes related to:

Illumos 5369 - arc flags should be an enum
and
Illumos 5408 - managing ZFS cache devices requires lots of RAM

hdr->b_l2hdr = l2hdr;
changed to
hdr->b_flags |= ARC_FLAG_HAS_L2HDR;

list_insert_head(dev->l2ad_buflist, hdr);
changed to
list_insert_head(&dev->l2ad_buflist, hdr);

References:
https://reviews.freebsd.org/D2764
openzfs#3400
openzfs#3433
openzfs#3451

Ported by: kernelOfTruth <kerneloftruth@gmail.com>
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Jun 13, 2015
The problem is that since OpenSolaris commit
illumos/illumos-gate@e14bb32
l2ad_hand is kept aligned based on ashift (which is derived from
the cache device's logical and physical block sizes).
So, the hand could be advanced by more than the sum of b_asize-s
of the written L2ARC buffers. This is because b_asize is a misnomer
at the moment as it does not always represent the allocated size:
if a buffer is compressed, then the compressed size is properly rounded,
but if the compression fails or it is not applied, then the original
size is kept and it could be smaller than what ashift requires.

For the same reasons arcstat_l2_asize and the reported used space
on the cache device could be smaller than the actual allocated size
if ashift > 9. That problem is not fixed by this change.

This change only ensures that l2ad_hand is not advanced by more
than target_sz. Otherwise we would overwrite active (unevicted)
L2ARC buffers. That problem is manifested as growing l2_cksum_bad
and l2_io_error counters.

This change also changes 'p' prefix to 'a' prefix in a few places
where variables represent allocated rather than physical size.

The resolved problem may also result in the reported allocated size
being greater than the cache device's capacity, because of the
overwritten buffers (more than one buffer claiming the same disk
space).

PR:	198242
PR:	195746 (possibly related)

Porting notes:

Rather difficult to track changes related to:

Illumos 5369 - arc flags should be an enum
and
Illumos 5408 - managing ZFS cache devices requires lots of RAM

hdr->b_l2hdr = l2hdr;
changed to
hdr->b_flags |= ARC_FLAG_HAS_L2HDR;

list_insert_head(dev->l2ad_buflist, hdr);
changed to
list_insert_head(&dev->l2ad_buflist, hdr);

References:
https://reviews.freebsd.org/D2764
openzfs#3400
openzfs#3433
openzfs#3451

Ported by: kernelOfTruth <kerneloftruth@gmail.com>
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Jun 13, 2015
The problem is that since OpenSolaris commit
illumos/illumos-gate@e14bb32
l2ad_hand is kept aligned based on ashift (which is derived from
the cache device's logical and physical block sizes).
So, the hand could be advanced by more than the sum of b_asize-s
of the written L2ARC buffers. This is because b_asize is a misnomer
at the moment as it does not always represent the allocated size:
if a buffer is compressed, then the compressed size is properly rounded,
but if the compression fails or it is not applied, then the original
size is kept and it could be smaller than what ashift requires.

For the same reasons arcstat_l2_asize and the reported used space
on the cache device could be smaller than the actual allocated size
if ashift > 9. That problem is not fixed by this change.

This change only ensures that l2ad_hand is not advanced by more
than target_sz. Otherwise we would overwrite active (unevicted)
L2ARC buffers. That problem is manifested as growing l2_cksum_bad
and l2_io_error counters.

This change also changes 'p' prefix to 'a' prefix in a few places
where variables represent allocated rather than physical size.

The resolved problem may also result in the reported allocated size
being greater than the cache device's capacity, because of the
overwritten buffers (more than one buffer claiming the same disk
space).

PR:	198242
PR:	195746 (possibly related)

Porting notes:

Rather difficult to track changes related to:

Illumos 5369 - arc flags should be an enum
and
Illumos 5408 - managing ZFS cache devices requires lots of RAM

hdr->b_l2hdr = l2hdr;
changed to
hdr->b_flags |= ARC_FLAG_HAS_L2HDR;

list_insert_head(dev->l2ad_buflist, hdr);
changed to
list_insert_head(&dev->l2ad_buflist, hdr);

References:
https://reviews.freebsd.org/D2764
openzfs#3400
openzfs#3433
openzfs#3451

Ported by: kernelOfTruth <kerneloftruth@gmail.com>
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Jun 14, 2015
The problem is that since OpenSolaris commit
illumos/illumos-gate@e14bb32
l2ad_hand is kept aligned based on ashift (which is derived from
the cache device's logical and physical block sizes).
So, the hand could be advanced by more than the sum of b_asize-s
of the written L2ARC buffers. This is because b_asize is a misnomer
at the moment as it does not always represent the allocated size:
if a buffer is compressed, then the compressed size is properly rounded,
but if the compression fails or it is not applied, then the original
size is kept and it could be smaller than what ashift requires.

For the same reasons arcstat_l2_asize and the reported used space
on the cache device could be smaller than the actual allocated size
if ashift > 9. That problem is not fixed by this change.

This change only ensures that l2ad_hand is not advanced by more
than target_sz. Otherwise we would overwrite active (unevicted)
L2ARC buffers. That problem is manifested as growing l2_cksum_bad
and l2_io_error counters.

This change also changes 'p' prefix to 'a' prefix in a few places
where variables represent allocated rather than physical size.

The resolved problem may also result in the reported allocated size
being greater than the cache device's capacity, because of the
overwritten buffers (more than one buffer claiming the same disk
space).

PR:	198242
PR:	195746 (possibly related)

Porting notes:

Rather difficult to track changes related to:

Illumos 5369 - arc flags should be an enum
and
Illumos 5408 - managing ZFS cache devices requires lots of RAM

hdr->b_l2hdr = l2hdr;
changed to
hdr->b_flags |= ARC_FLAG_HAS_L2HDR;

list_insert_head(dev->l2ad_buflist, hdr);
changed to
list_insert_head(&dev->l2ad_buflist, hdr);

Account for the error message:
error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
uint64_t stats_size = 0;

References:
https://reviews.freebsd.org/D2764
openzfs#3400
openzfs#3433
openzfs#3451

Ported by: kernelOfTruth <kerneloftruth@gmail.com>
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Jun 14, 2015
The problem is that since OpenSolaris commit
illumos/illumos-gate@e14bb32
l2ad_hand is kept aligned based on ashift (which is derived from
the cache device's logical and physical block sizes).
So, the hand could be advanced by more than the sum of b_asize-s
of the written L2ARC buffers. This is because b_asize is a misnomer
at the moment as it does not always represent the allocated size:
if a buffer is compressed, then the compressed size is properly rounded,
but if the compression fails or it is not applied, then the original
size is kept and it could be smaller than what ashift requires.

For the same reasons arcstat_l2_asize and the reported used space
on the cache device could be smaller than the actual allocated size
if ashift > 9. That problem is not fixed by this change.

This change only ensures that l2ad_hand is not advanced by more
than target_sz. Otherwise we would overwrite active (unevicted)
L2ARC buffers. That problem is manifested as growing l2_cksum_bad
and l2_io_error counters.

This change also changes 'p' prefix to 'a' prefix in a few places
where variables represent allocated rather than physical size.

The resolved problem may also result in the reported allocated size
being greater than the cache device's capacity, because of the
overwritten buffers (more than one buffer claiming the same disk
space).

PR:	198242
PR:	195746 (possibly related)

Porting notes:

Rather difficult to track changes related to:

Illumos 5369 - arc flags should be an enum
and
Illumos 5408 - managing ZFS cache devices requires lots of RAM

hdr->b_l2hdr = l2hdr;
changed to
hdr->b_flags |= ARC_FLAG_HAS_L2HDR;

list_insert_head(dev->l2ad_buflist, hdr);
changed to
list_insert_head(&dev->l2ad_buflist, hdr);

Account for the error message:
error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
uint64_t stats_size = 0;

References:
https://reviews.freebsd.org/D2764
openzfs#3400
openzfs#3433
openzfs#3451

Ported by: kernelOfTruth <kerneloftruth@gmail.com>
@avg-I
Copy link
Contributor

avg-I commented Jun 16, 2015

BTW, I've updated the patch against the latest illumos code: https://reviews.csiden.org/r/229/

@avg-I
Copy link
Contributor

avg-I commented Jun 16, 2015

Ah, I've missed #3491 - thanks!

@kernelOfTruth
Copy link
Contributor Author

@pzwahlen the updated fix is at #3491

@avg-I seems like I need to update the commit message again =)

Thank you for your hard work !

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Jun 24, 2015
The problem is that since OpenSolaris commit
illumos/illumos-gate@e14bb32
l2ad_hand is kept aligned based on ashift (which is derived from
the cache device's logical and physical block sizes).
So, the hand could be advanced by more than the sum of b_asize-s
of the written L2ARC buffers. This is because b_asize is a misnomer
at the moment as it does not always represent the allocated size:
if a buffer is compressed, then the compressed size is properly rounded,
but if the compression fails or it is not applied, then the original
size is kept and it could be smaller than what ashift requires.

For the same reasons arcstat_l2_asize and the reported used space
on the cache device could be smaller than the actual allocated size
if ashift > 9. That problem is not fixed by this change.

This change only ensures that l2ad_hand is not advanced by more
than target_sz. Otherwise we would overwrite active (unevicted)
L2ARC buffers. That problem is manifested as growing l2_cksum_bad
and l2_io_error counters.

This change also changes 'p' prefix to 'a' prefix in a few places
where variables represent allocated rather than physical size.

The resolved problem may also result in the reported allocated size
being greater than the cache device's capacity, because of the
overwritten buffers (more than one buffer claiming the same disk
space).

PR:	198242
PR:	195746 (possibly related)

Porting notes:

Rather difficult to track changes related to:

Illumos 5369 - arc flags should be an enum
and
Illumos 5408 - managing ZFS cache devices requires lots of RAM

hdr->b_l2hdr = l2hdr;
changed to
hdr->b_flags |= ARC_FLAG_HAS_L2HDR;

list_insert_head(dev->l2ad_buflist, hdr);
changed to
list_insert_head(&dev->l2ad_buflist, hdr);

Account for the error message:
error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
uint64_t stats_size = 0;

References:
https://reviews.freebsd.org/D2764
openzfs#3400
openzfs#3433
openzfs#3451

Ported by: kernelOfTruth <kerneloftruth@gmail.com>
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 24, 2015
If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual
increment to l2ad_hand was introduced in commit 3a17a7a.

The change that introduced l2ad_hand alignment was almost correct
as the write size was accumulated as a sum of rounded buffer sizes.
See commit illumos/illumos-gate@e14bb32.

Also, we now consistently use asize / a_sz for the allocated size and
psize / p_sz for the physical size.  The latter accounts for a
possible size reduction because of the compression, whereas the
former accounts for a possible subsequent size expansion because of
the alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical
I/O has a suitable size.

Note that currently the cache device utilization is calculated based
on the physical size, not the allocated size.  The same applies to
l2_asize kstat. That is wrong, but this commit does not fix that.
The accounting problem was introduced partially in commit 3a17a7a
and partially in 3038a2b (accounting became consistent but in favour
of the wrong size).

Porting Notes:

Reworked to be C90 compatible and the 'write_psize' variable was
removed because it is now unused.

References:
  https://reviews.csiden.org/r/229/
  https://reviews.freebsd.org/D2764

Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3400
Issue openzfs#3433
Issue openzfs#3451
@kernelOfTruth
Copy link
Contributor Author

closing, superseded by #3521

behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 25, 2015
If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual
increment to l2ad_hand was introduced in commit 3a17a7a.

The change that introduced l2ad_hand alignment was almost correct
as the write size was accumulated as a sum of rounded buffer sizes.
See commit illumos/illumos-gate@e14bb32.

Also, we now consistently use asize / a_sz for the allocated size and
psize / p_sz for the physical size.  The latter accounts for a
possible size reduction because of the compression, whereas the
former accounts for a possible subsequent size expansion because of
the alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical
I/O has a suitable size.

Note that currently the cache device utilization is calculated based
on the physical size, not the allocated size.  The same applies to
l2_asize kstat. That is wrong, but this commit does not fix that.
The accounting problem was introduced partially in commit 3a17a7a
and partially in 3038a2b (accounting became consistent but in favour
of the wrong size).

Porting Notes:

Reworked to be C90 compatible and the 'write_psize' variable was
removed because it is now unused.

References:
  https://reviews.csiden.org/r/229/
  https://reviews.freebsd.org/D2764

Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3400
Closes openzfs#3433
Closes openzfs#3451
dweeezil pushed a commit to dweeezil/zfs that referenced this pull request Aug 20, 2015
If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual
increment to l2ad_hand was introduced in commit 3a17a7a.

The change that introduced l2ad_hand alignment was almost correct
as the write size was accumulated as a sum of rounded buffer sizes.
See commit illumos/illumos-gate@e14bb32.

Also, we now consistently use asize / a_sz for the allocated size and
psize / p_sz for the physical size.  The latter accounts for a
possible size reduction because of the compression, whereas the
former accounts for a possible subsequent size expansion because of
the alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical
I/O has a suitable size.

Note that currently the cache device utilization is calculated based
on the physical size, not the allocated size.  The same applies to
l2_asize kstat. That is wrong, but this commit does not fix that.
The accounting problem was introduced partially in commit 3a17a7a
and partially in 3038a2b (accounting became consistent but in favour
of the wrong size).

Porting Notes:

Reworked to be C90 compatible and the 'write_psize' variable was
removed because it is now unused.

References:
  https://reviews.csiden.org/r/229/
  https://reviews.freebsd.org/D2764

Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3400
Closes openzfs#3433
Closes openzfs#3451

Cherry-pick ported by: Tim Chase <tim@chase2k.com>
Cherry-picked from ef56b07
dweeezil pushed a commit to dweeezil/zfs that referenced this pull request Aug 20, 2015
If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual
increment to l2ad_hand was introduced in commit 3a17a7a.

The change that introduced l2ad_hand alignment was almost correct
as the write size was accumulated as a sum of rounded buffer sizes.
See commit illumos/illumos-gate@e14bb32.

Also, we now consistently use asize / a_sz for the allocated size and
psize / p_sz for the physical size.  The latter accounts for a
possible size reduction because of the compression, whereas the
former accounts for a possible subsequent size expansion because of
the alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical
I/O has a suitable size.

Note that currently the cache device utilization is calculated based
on the physical size, not the allocated size.  The same applies to
l2_asize kstat. That is wrong, but this commit does not fix that.
The accounting problem was introduced partially in commit 3a17a7a
and partially in 3038a2b (accounting became consistent but in favour
of the wrong size).

Porting Notes:

Reworked to be C90 compatible and the 'write_psize' variable was
removed because it is now unused.

References:
https://reviews.csiden.org/r/229/
https://reviews.freebsd.org/D2764

Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3400
Closes openzfs#3433
Closes openzfs#3451

Ported-to-0.6.4.2-by: Tim Chase <tim@chase2k.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants