helium iceberg handle retry on commit conflict during the wap workflow#1168
Conversation
| /// If the table has no snapshots yet, this is a no-op — the branch ref | ||
| /// will be created by the first `commit_to_branch` call instead. | ||
| /// | ||
| /// Retries with exponential backoff on commit conflicts. |
There was a problem hiding this comment.
Do we want to consider using backon here?
We can break out the actual branch creating to something like do_create_branch, then this function becomes about the retrying and error handling.
pub(crate) async fn create_branch(
catalog: &Catalog,
table: &RwLock<Table>,
branch_name: &str,
) -> Result<()> {
validate_branch_name(branch_name)?;
use backon::{ExponentialBuilder, Retryable};
(|| do_create_branch(catalog, table, branch_name))
.retry(
ExponentialBuilder::default()
.with_min_delay(COMMIT_RETRY_BASE_DELAY)
.with_max_delay(COMMIT_RETRY_MAX_DELAY)
.with_max_times(COMMIT_MAX_RETRIES),
)
.when(|err| err.is_commit_conflict())
.notify(|err, dur| {
tracing::warn!(
?err,
branch_name,
delay_ms = dur.as_millis() as u64,
"commit conflict, retrying create_branch"
)
})
.await
}```| StatusCode::CONFLICT => Err(Error::Catalog( | ||
| "commit conflict: one or more requirements failed".into(), | ||
| StatusCode::CONFLICT => Err(Error::CommitConflict( | ||
| "one or more requirements failed".into(), |
There was a problem hiding this comment.
Should we be putting the body of the response in these errors? So we can start to differentiate between the different types of conflicts?
| pub(crate) const WAP_ENABLED_PROPERTY: &str = "write.wap.enabled"; | ||
| pub(crate) const WAP_ID_KEY: &str = "wap.id"; | ||
|
|
||
| fn commit_backoff() -> ExponentialBuilder { |
There was a problem hiding this comment.
It is no where near necessary here, just thought it was interesting to point out. All the method on ExpontentialBuilder are const.
| }, | ||
| ]; | ||
|
|
||
| commit(catalog, table_guard.identifier(), updates, requirements).await |
There was a problem hiding this comment.
If we move the backon retry stuff into commit, we get retries for all commits on conflict, and our other functions don't need to be as aware of that error.
There was a problem hiding this comment.
I am open to refactor around this in the future, but not sure it is worth it at the moment.
When the mobile-packet-verifier daemon and backfill run concurrently, they race on the same Iceberg table's
sequence number and snapshot refs, causing "Cannot add snapshot with sequence number N older than last
sequence number N" errors (returned as 400 from Polaris)
Add CommitConflict error variant that detects conflict responses — both HTTP 409 and 400s with sequence
number/requirement failure messages
Add retry with exponential backoff (100ms base, 5s max, 4 retries) directly inside create_branch,
commit_to_branch, and publish_branch in branch.rs — each reloads fresh table metadata before retrying
iceberg_table.rs callers are now simple direct calls with no retry wrappers
Files changed
for testability
and backoff helper