Skip to content

Commit

Permalink
DLPX-81891 agent panicked in reclaim_frees_object(): blocks_size vs n…
Browse files Browse the repository at this point in the history
…um_bytes mismatch (openzfs#605)

Whenever we do a PutObject, we use a timeout of `PER_REQUEST_TIMEOUT` which is
2 seconds.  The timeout is implemented via `tokio::time::timeout()`, which
drops the Future if the timeout expires.  See `object_access::retry()`.  When
this happens, we don't have visibility into the internal state of S3.  In
particular, we don't know if the PutObject was ignored, actually completed
(e.g. we timed out just as it was sending a response to us), or if it will
complete at some point of time in the future.

The problem occurs when a later PutObject (e.g. due to reclaiming freed blocks)
overwrites the same object with different data, and the first, timed-out
PutObject is applied at a later time, overwriting the second PutObject's data.
This looks like the second PutObject had no effect, but in fact it did take
effect, but was later overwritten by the first, timed-out PutObject.

If the overwritten data contains blocks from a different (consolidated) object,
those blocks will be lost.  To mitigate the problem, in this commit we disable
object consolidation.  In other words, when reclaiming freed space, a given
object may be overwritten only with a subset of its original blocks, with no
additional blocks added.  This way, if a timed-out PutObject takes effect
later, the reverted state of the object only contains additional blocks (which
will be leaked, at least until the entire object is deleted).
  • Loading branch information
ahrens committed Sep 8, 2022
1 parent 6f555f2 commit a18cf9e
Showing 1 changed file with 33 additions and 5 deletions.
38 changes: 33 additions & 5 deletions cmd/zfs_object_agent/zettaobject/src/reclaim.rs
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,8 @@ tunable! {
// When reclaiming free blocks, allow this many concurrent GetObject+PutObject requests.
static ref RECLAIM_QUEUE_DEPTH: usize = 200;

static ref VERIFY_RECLAIM_PUT: bool = true;
static ref VERIFY_RECLAIM_PUT: bool = false;
static ref MITIGATE_OVERWRITE_BUG: bool = true;
}

lazy_static! {
Expand Down Expand Up @@ -698,6 +699,18 @@ impl Reclaim {
freed_blocks_count += u64::from(later_blocks_freed);
freed_blocks_bytes += u64::from(later_bytes_freed);
objects_to_consolidate.push((later_object_new_size, frees));
if *MITIGATE_OVERWRITE_BUG {
// A timed-out PutObject may take effect at an unknown later time. If a
// subsequent PutObject overwrites the same object with different data,
// the first PutObject may be applied at a later time, overwriting the
// second PutObject's data. If the overwritten data contains blocks from
// a different (consolidated) object, those blocks will be lost. To
// mitigate this, we do not consolidate objects. This way, if a
// timed-out PutObject takes effect later, the reverted state of the
// object only contains additional blocks (which will be leaked, at least
// until the entire object is delted).
break;
}
}
// XXX look for earlier objects too?

Expand Down Expand Up @@ -869,15 +882,30 @@ async fn reclaim_frees_object(
.blocks
.retain(|block, _| block < &next_block);

assert_ge!(phys.header.blocks_size, object_size.num_bytes);
if !*MITIGATE_OVERWRITE_BUG {
assert_ge!(phys.header.blocks_size, object_size.num_bytes);
}
phys.header.blocks_size = object_size.num_bytes;

assert_ge!(phys.header.next_block, next_block);
phys.header.next_block = next_block;
}
assert_eq!(phys.header.blocks_size, phys.calculate_blocks_size());
assert_eq!(phys.header.blocks_size, object_size.num_bytes);
assert_eq!(phys.blocks_len(), object_size.num_blocks);
if *MITIGATE_OVERWRITE_BUG {
// A timed-out and then later-applied PutObject may have reverted the object to a
// previous state which has more blocks than we expect.
if phys.header.blocks_size != object_size.num_bytes ||
phys.blocks_len() != object_size.num_blocks
{
warn!("reclaim: {object:?} expected size {} blocks {}, found size {} blocks {}",
object_size.num_bytes, object_size.num_blocks,
phys.header.blocks_size, phys.blocks_len());
}
phys.header.blocks_size = phys.calculate_blocks_size();
} else {
assert_eq!(phys.header.blocks_size, phys.calculate_blocks_size());
assert_eq!(phys.header.blocks_size, object_size.num_bytes);
assert_eq!(phys.blocks_len(), object_size.num_blocks);
}

phys
})
Expand Down

0 comments on commit a18cf9e

Please sign in to comment.