Skip to content

Commit

Permalink
DLPX-87575 agent panic: request has timed out (openzfs#1160)
Browse files Browse the repository at this point in the history
When restarting the agent, it uses ListObjects operations to find any objects
that are part of the in-progress TXG.  If there are a large number of these
objects, this may take a while, resulting in the following agent panic:

```
thread 'zoa' panicked at 'called `Result::unwrap()` on an `Err` value: request has timed out

Caused by:
    operation attempt timeout (single attempt) occurred after 2s', zettaobject/src/object_access/mod.rs:491:42
stack backtrace:
...
      zettaobject::object_access::ObjectAccess::try_list_after::{{closure}}
             at zfs/cmd/zfs_object_agent/zettaobject/src/object_access/mod.rs:491:35
...
   8: zettaobject::pool::recover_list::{{closure}}::{{closure}}::{{closure}}
             at zfs/cmd/zfs_object_agent/zettaobject/src/pool.rs:1928:17
```

The problem is that we configure the “retrying” client of the AWS SDK to have a
2 second timeout.  (This “retrying” client is only used for ListObjects
requests.) After this timeout expires, the SDK returns an error.  However, in
this case the request may reasonably take longer, and we want to keep waiting.

The fix is to not configure a timeout on the “retrying” client.
  • Loading branch information
ahrens committed Aug 28, 2023
1 parent d022a59 commit 79ae554
Showing 1 changed file with 0 additions and 5 deletions.
5 changes: 0 additions & 5 deletions cmd/zfs_object_agent/zettaobject/src/object_access/s3.rs
Original file line number Diff line number Diff line change
Expand Up @@ -204,11 +204,6 @@ impl S3ObjectAccess {
.with_max_attempts(u32::MAX)
.with_initial_backoff(Duration::from_millis(100)),
)
.timeout_config(
TimeoutConfig::builder()
.operation_attempt_timeout(*PER_REQUEST_TIMEOUT)
.build(),
)
.load()
.await;
// Force path style URLs, which are required for on-prem object stores.
Expand Down

0 comments on commit 79ae554

Please sign in to comment.