Skip to content

helium iceberg handle retry on commit conflict during the wap workflow#1168

Merged
bbalser merged 5 commits into
mainfrom
bbalser/iceberg-wap-retry
Apr 7, 2026
Merged

helium iceberg handle retry on commit conflict during the wap workflow#1168
bbalser merged 5 commits into
mainfrom
bbalser/iceberg-wap-retry

Conversation

@bbalser
Copy link
Copy Markdown
Collaborator

@bbalser bbalser commented Apr 1, 2026

  • When the mobile-packet-verifier daemon and backfill run concurrently, they race on the same Iceberg table's
    sequence number and snapshot refs, causing "Cannot add snapshot with sequence number N older than last
    sequence number N" errors (returned as 400 from Polaris)

  • Add CommitConflict error variant that detects conflict responses — both HTTP 409 and 400s with sequence
    number/requirement failure messages

  • Add retry with exponential backoff (100ms base, 5s max, 4 retries) directly inside create_branch,
    commit_to_branch, and publish_branch in branch.rs — each reloads fresh table metadata before retrying

  • iceberg_table.rs callers are now simple direct calls with no retry wrappers

    Files changed

    • error.rs — CommitConflict variant + is_commit_conflict() helper
    • rest_endpoint.rs — Return CommitConflict for 409 and conflict-indicating 400s; extracted is_conflict_body()
      for testability
    • branch.rs — Retry with metadata reload in create_branch, commit_to_branch, publish_branch; retry constants
      and backoff helper
    • iceberg_table.rs — Simplified to direct branch calls; reload_table made pub(crate)

@bbalser bbalser requested a review from michaeldjeffrey April 1, 2026 19:08
@bbalser bbalser marked this pull request as ready for review April 1, 2026 19:08
/// If the table has no snapshots yet, this is a no-op — the branch ref
/// will be created by the first `commit_to_branch` call instead.
///
/// Retries with exponential backoff on commit conflicts.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to consider using backon here?

We can break out the actual branch creating to something like do_create_branch, then this function becomes about the retrying and error handling.

pub(crate) async fn create_branch(
    catalog: &Catalog,
    table: &RwLock<Table>,
    branch_name: &str,
) -> Result<()> {
    validate_branch_name(branch_name)?;

    use backon::{ExponentialBuilder, Retryable};

    (|| do_create_branch(catalog, table, branch_name))
        .retry(
            ExponentialBuilder::default()
                .with_min_delay(COMMIT_RETRY_BASE_DELAY)
                .with_max_delay(COMMIT_RETRY_MAX_DELAY)
                .with_max_times(COMMIT_MAX_RETRIES),
        )
        .when(|err| err.is_commit_conflict())
        .notify(|err, dur| {
            tracing::warn!(
                ?err,
                branch_name,
                delay_ms = dur.as_millis() as u64,
                "commit conflict, retrying create_branch"
            )
        })
        .await
}```

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated 906a081

StatusCode::CONFLICT => Err(Error::Catalog(
"commit conflict: one or more requirements failed".into(),
StatusCode::CONFLICT => Err(Error::CommitConflict(
"one or more requirements failed".into(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be putting the body of the response in these errors? So we can start to differentiate between the different types of conflicts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, updated 4bfaef7

pub(crate) const WAP_ENABLED_PROPERTY: &str = "write.wap.enabled";
pub(crate) const WAP_ID_KEY: &str = "wap.id";

fn commit_backoff() -> ExponentialBuilder {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no where near necessary here, just thought it was interesting to point out. All the method on ExpontentialBuilder are const.

},
];

commit(catalog, table_guard.identifier(), updates, requirements).await
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move the backon retry stuff into commit, we get retries for all commits on conflict, and our other functions don't need to be as aware of that error.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to refactor around this in the future, but not sure it is worth it at the moment.

@bbalser bbalser merged commit 1ccb57d into main Apr 7, 2026
29 checks passed
@bbalser bbalser deleted the bbalser/iceberg-wap-retry branch April 7, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants