Hi google friends! Apologies for the trouble!! We found this interesting issue when using tensorstore!! Thanks for taking a look
Summary
When an S3 PUT request is timed out server-side, the TensorStore client can hang for ~30 minutes before retrying. During this window, S3 access logs show zero retries — the request appears to be "in flight" on the client but dead on the
server. The receive of the request is confirmed by s3 access logs.
The ~30-minute gap matches the default OS-level TCP keepalive + retransmission timeout, which is the only thing that eventually forces the stalled socket closed.
Root Cause
2 layers of missing timeout configuration compound each other:
1. S3 driver issues requests with no timeouts (s3_key_value_store.cc:764)
auto future = owner->transport_->IssueRequest(
request, internal_http::IssueRequestOptions(value_));
// IssueRequestOptions only sets payload; request_timeout and connect_timeout
// remain at their defaults of absl::ZeroDuration().
- Zero duration → curl sets nothing (curl_transport.cc:198-205)
if (options.request_timeout > absl::ZeroDuration()) { // never true
handle_.SetOption(CURLOPT_TIMEOUT_MS, ...);
}
CURLOPT_TIMEOUT_MS is never set, so libcurl has no deadline on the transfer.
Workaround
Set environment variables before launching your process:
export TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=30
export TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=1
This enables the low-speed watchdog: if fewer than 1 byte/sec is transferred for 30 seconds, libcurl aborts the request and the retry logic takes over.
Suggested fix
- Expose request timeout and connect timeout to s3 query driver
Hi google friends! Apologies for the trouble!! We found this interesting issue when using tensorstore!! Thanks for taking a look
Summary
When an S3 PUT request is timed out server-side, the TensorStore client can hang for ~30 minutes before retrying. During this window, S3 access logs show zero retries — the request appears to be "in flight" on the client but dead on the
server. The receive of the request is confirmed by s3 access logs.
The ~30-minute gap matches the default OS-level TCP keepalive + retransmission timeout, which is the only thing that eventually forces the stalled socket closed.
Root Cause
2 layers of missing timeout configuration compound each other:
1. S3 driver issues requests with no timeouts (
s3_key_value_store.cc:764)if (options.request_timeout > absl::ZeroDuration()) { // never true
handle_.SetOption(CURLOPT_TIMEOUT_MS, ...);
}
CURLOPT_TIMEOUT_MS is never set, so libcurl has no deadline on the transfer.
Workaround
Set environment variables before launching your process:
export TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=30
export TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=1
This enables the low-speed watchdog: if fewer than 1 byte/sec is transferred for 30 seconds, libcurl aborts the request and the retry logic takes over.
Suggested fix