Update RFC-0001 for archival locking and archival barrier semantics by HumairAK · Pull Request #9 · mlflow/rfcs

HumairAK · 2026-04-22T04:45:46Z

Summary

This PR updates RFC-0001 for trace archival to reflect the current design for both archival payload location semantics and archival/write coordination.

In particular, it:

keeps mlflow.artifactLocation reserved for the trace's ordinary artifact root, including attachments
introduces mlflow.trace.archiveLocation as the archive-specific location for archived traces.pb payloads
updates retrieval, deletion, and repository-ingestion semantics to consistently use the archive-specific location when SpansLocation=ARCHIVE_REPO
clarifies the existing artifact-backed trace path so the RFC distinguishes ordinary trace artifacts from archival payload storage
replaces the earlier transient archival-barrier approach with version-based coordination using a per-trace trace_version
documents atomic archival finalization semantics so concurrent DB-backed writes and archival cannot leave a trace in a mixed DB/archive state
clarifies crash recovery and retry behavior when archival is interrupted or loses a race with a concurrent write
updates the implementation-impact and migration notes to reflect the new archive-location tag and trace_version coordination model

Implementation context

Related implementation details:

mprahl · 2026-04-23T21:38:25Z

+7. **Finalize archival atomically:** Update `spans.content` to an empty string, set the trace tag `mlflow.trace.spansLocation` = `SpansLocation.ARCHIVE_REPO`, and clear the transient `mlflow.trace.archiving` tag.
+8. **Clear the emergency tag:** If the archival was triggered by `archive now` and the emergency archival pass completes successfully, clear the tag so it behaves as a one-shot failsafe.
+
+`log_spans()` must cooperate with this barrier. In particular, DB-backed span writes should check for blocked traces both before and after their batch upserts so a write that races with archival fails and rolls back instead of partially succeeding and then reasserting `TRACKING_STORE` after archival has begun.


Could this be optimized so that the span insert fails by using a where clause on the archive in progress tag not being set? This would avoid the additional read.

Unfortunately this wouldn't be enough to coordinate the writes. In the current SQL-backed implementation, we have to coordinate archival and log_spans() more broadly than just making the insert conditional. Once archival takes ownership of a trace, conflicting DB-backed writes are rejected so they can’t partially succeed and then reassert TRACKING_STORE.

The relevant code can be found here.

mprahl · 2026-04-23T21:40:11Z

@HumairAK I'd suggest that future RFC updates should be less implementation detail heavy and be focused more on the decisions to be made.

You could have avoided my comment 😆 : #9 (comment)

HumairAK · 2026-04-24T13:23:45Z

I agree, my original intention was to just draw attention that we'll need to add coordination to existing log_spans() but I think it went a bit too into the weeds, I updated the doc to focus more on the semantics of coordination. See the last commit.

mprahl · 2026-04-24T21:17:13Z

+```python
+def upgrade():
+    with op.batch_alter_table("trace_info", schema=None) as batch_op:
+        batch_op.add_column(sa.Column("trace_version", sa.BigInteger(), nullable=True))


This needs to be updated after the latest commit, right?

mprahl · 2026-04-24T21:22:58Z

@B-Step62 could you please take a look at this?

Keep trace artifact roots and archival payload locations distinct so archived traces do not interfere with existing trace artifacts or attachments.

kramaranya · 2026-04-29T17:54:08Z

+**Use a per-trace DB-backed payload version.** This coordination is needed because archival is no longer just a single final state flip. The server must first select the trace and snapshot its current DB-backed payload version, then read the DB-backed spans it intends to archive, then upload `traces.pb`, and only then decide whether it may finalize. Each successful DB-backed write that changes the DB-backed trace payload must atomically advance this version so archival finalization can detect whether the payload changed after it was selected. A DB-backed write must only succeed while the trace still remains DB-backed.

-The archival repository URI for archived traces (where to read `traces.pb`) continues to be recorded via the existing `mlflow.artifactLocation` tag. The `mlflow.trace.spansLocation` tag disambiguates how that URI is interpreted: when span data is in `ARTIFACT_REPO`, it points to the existing artifact-backed trace path; when span data is in `ARCHIVE_REPO`, it points to the archived file location in the archival repository for `traces.pb`. The **effective archival repository root** for a given trace is the workspace's `trace_archival_location` when set, otherwise the server's global `--trace-archival-location` (or default artifact root).
+Add a `trace_version` column on `trace_info` (or an equivalent per-trace generation field) for this purpose, logically defaulting to `0`. This version is MLflow-internal coordination state for DB-backed traces and is not part of the archived OTLP `TracesData` payload. It tracks the generation of the DB-backed span payload, and any DB-backed trace state derived from `log_spans()` writes, while the trace is in `TRACKING_STORE`; unrelated metadata changes and archival's own `SpansLocation` transition do not need to advance it.


What happens to trace_version after trace is archived? If later trace is restored back to TRACKING_STORE does it pick up from the old version or reset?

Good question. Restore is future work and out of scope here, so trace_version semantics after restore are intentionally unspecified. (See out of scope section).

But for future work, my preference would be reset the version instead of advancing it.

Document version-based coordination between archival finalization and concurrent DB-backed writes so traces cannot end up with mixed DB and archive payload state.

B-Step62 · 2026-05-03T04:33:54Z

-Traces stay in `TRACKING_STORE` until archival completes (export to the archival repository + clear DB content + set tag to `ARCHIVE_REPO`). If the process crashes mid-archival, the trace remains `TRACKING_STORE` and will be selected again on the next run (re-export overwrites the file, then clear and set tag).
+**Use a per-trace DB-backed payload version.** This coordination is needed because archival is no longer just a single final state flip. The server must first select the trace and snapshot its current DB-backed payload version, then read the DB-backed spans it intends to archive, then upload `traces.pb`, and only then decide whether it may finalize. Each successful DB-backed write that changes the DB-backed trace payload must atomically advance this version so archival finalization can detect whether the payload changed after it was selected. A DB-backed write must only succeed while the trace still remains DB-backed.
+
+Add a `trace_version` column on `trace_info` (or an equivalent per-trace generation field) for this purpose, logically defaulting to `0`. This version is MLflow-internal coordination state for DB-backed traces and is not part of the archived OTLP `TracesData` payload. It tracks the generation of the DB-backed span payload, and any DB-backed trace state derived from `log_spans()` writes, while the trace is in `TRACKING_STORE`; unrelated metadata changes and archival's own `SpansLocation` transition do not need to advance it.


Can we add this as a part of existing trace metadata key-value map? I don't feel like the addition of simple version metadata without heavy query requirement justifies the new top level field and schema migration.

Hey @B-Step62, using trace metadata to resolve the trace write coordination was my initial approach. The reasons for why I decided to go a different route are addressed here.

Thanks, I also thought about it again and convinced that the new integer column is simpler approach. Using metadata has several other headaches e.g. string casting, handling upsert, etc.

B-Step62 · 2026-05-03T04:36:56Z

-Traces stay in `TRACKING_STORE` until archival completes (export to the archival repository + clear DB content + set tag to `ARCHIVE_REPO`). If the process crashes mid-archival, the trace remains `TRACKING_STORE` and will be selected again on the next run (re-export overwrites the file, then clear and set tag).
+**Use a per-trace DB-backed payload version.** This coordination is needed because archival is no longer just a single final state flip. The server must first select the trace and snapshot its current DB-backed payload version, then read the DB-backed spans it intends to archive, then upload `traces.pb`, and only then decide whether it may finalize. Each successful DB-backed write that changes the DB-backed trace payload must atomically advance this version so archival finalization can detect whether the payload changed after it was selected. A DB-backed write must only succeed while the trace still remains DB-backed.
+
+Add a `trace_version` column on `trace_info` (or an equivalent per-trace generation field) for this purpose, logically defaulting to `0`. This version is MLflow-internal coordination state for DB-backed traces and is not part of the archived OTLP `TracesData` payload. It tracks the generation of the DB-backed span payload, and any DB-backed trace state derived from `log_spans()` writes, while the trace is in `TRACKING_STORE`; unrelated metadata changes and archival's own `SpansLocation` transition do not need to advance it.


The name trace_version is too broad, we already have a metadata that represents the trace schema version. Can we use a different self-explanatory name?

I agree, trace_version doesn't exactly convey that this is intended as an internal payload generation to assess read freshness. How about something narrower like db_payload_generation?

Yea db_payload_generation works better!

B-Step62 · 2026-05-03T04:44:38Z

 - **`ARTIFACT_REPO`** (existing): Retained for existing behavior (e.g. V3 API traces whose spans are stored in the artifact store). Distinct from `ARCHIVE_REPO` for the proposed OTLP archival repository.

-Traces stay in `TRACKING_STORE` until archival completes (export to the archival repository + clear DB content + set tag to `ARCHIVE_REPO`). If the process crashes mid-archival, the trace remains `TRACKING_STORE` and will be selected again on the next run (re-export overwrites the file, then clear and set tag).
+**Use a per-trace DB-backed payload version.** This coordination is needed because archival is no longer just a single final state flip. The server must first select the trace and snapshot its current DB-backed payload version, then read the DB-backed spans it intends to archive, then upload `traces.pb`, and only then decide whether it may finalize. Each successful DB-backed write that changes the DB-backed trace payload must atomically advance this version so archival finalization can detect whether the payload changed after it was selected. A DB-backed write must only succeed while the trace still remains DB-backed.


@HumairAK Have we consider alternatives to this solution? I felt like the version-based check is a bit over-complex to the size of problem we're trying to avoid. In practice, it will be rare that users need to archive traces with a very short span after they are written as the archival job has a scheduling buffer anyway. Therefore, one option is just to mandate some buffer for the archival trace timestamp e.g. users can only archive traces after 5 minutes, or maybe document that the state consistency of archiving in-flight trace is best-effort.

Hey @B-Step62, I had the same concerns and had initially gone with a best-effort approach as well (this is what is currently in the feature branch).

I considered other simpler options, but the main issue here is not queryability, it is having a cheap internal freshness token during the archive_trace job that will undergo a DB read, archive upload, and finalization db-write step.

In the current db schema, since the “existing metadata map” is a separate key-value table, storing this there would make the co-ordination path more complex, not less. Since tags are user facing, this would also make public what should otherwise be an internal tracking implementation detail.

We also previously tried a transient archiving barrier using trace tags on the writer side, but that added steady-state overhead to log_spans() and extra stale-marker handling.

A minimum-age buffer can reduce race frequency, but it does not provide the same correctness guarantee. Note that log_spans() currently allows spans to be added even after they have reached a terminal state, so in such scenarios age buffer would be insufficient. The same applies to cases like archive_now = {}, which intentionally bypass age-based assumptions.

Yes the per-trace generation mainly about making archival finalization safe. But during implementation, it did also help in a secondary way by making start_trace() participate in the same storage-consistency protocol as other DB-backed writes. So if log_spans() created the DB-backed trace and start_trace() later fills in the authoritative trace info, that update now advances the generation and causes stale archival snapshots to lose. In that sense, the internal generation token is not as narrow as it may seem: it also provides a coordination mechanism that future DB-backed writers can participate in.

start_trace() and log_spans() already have to coordinate concurrent writes to the same trace, and archival introduces a third path where races can occur. So in my opinion, this is not an issue just isolated to trace archival. It is a case where the trace write path is becoming more concurrent overall, and having an explicit internal coordination protocol starts becoming worth it.

Thank you for the detailed description! You're right, if the solution scopes the general race between start_trace and log_spans beyond an edge case in archival, the complexity is justified.

B-Step62

LGTM once https://github.com/mlflow/rfcs/pull/9/files#r3177631622 is addressed.

HumairAK mentioned this pull request Apr 23, 2026

Clarify trace archival payload location. #7

Closed

mprahl reviewed Apr 23, 2026

View reviewed changes

HumairAK requested a review from mprahl April 24, 2026 13:25

HumairAK force-pushed the trace_archival_coordination branch from 598d41e to fa230c4 Compare April 24, 2026 20:45

mprahl reviewed Apr 24, 2026

View reviewed changes

Comment thread rfcs/0001-trace-archival/0001-trace-archival.md Outdated

mprahl approved these changes Apr 24, 2026

View reviewed changes

mprahl reviewed Apr 24, 2026

View reviewed changes

HumairAK force-pushed the trace_archival_coordination branch from bd35d41 to f5e6b65 Compare April 24, 2026 21:19

mprahl approved these changes Apr 24, 2026

View reviewed changes

Clarify RFC-0001 archival payload location semantics

a6780bd

Keep trace artifact roots and archival payload locations distinct so archived traces do not interfere with existing trace artifacts or attachments.

HumairAK force-pushed the trace_archival_coordination branch from f5e6b65 to 01df0db Compare April 27, 2026 14:28

HumairAK mentioned this pull request Apr 27, 2026

Serialize SQL trace archival with concurrent span writes mlflow/mlflow#22777

Open

31 tasks

jwm4 mentioned this pull request Apr 27, 2026

RFC-0005 and RFC-0006: Skill Registry and Harness Integration #10

Draft

kramaranya reviewed Apr 29, 2026

View reviewed changes

Clarify RFC-0001 archival coordination semantics

aca54cd

Document version-based coordination between archival finalization and concurrent DB-backed writes so traces cannot end up with mixed DB and archive payload state.

HumairAK force-pushed the trace_archival_coordination branch from 01df0db to aca54cd Compare April 29, 2026 18:46

B-Step62 requested changes May 3, 2026

View reviewed changes

B-Step62 approved these changes May 4, 2026

View reviewed changes

Conversation

HumairAK commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation context

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mprahl commented Apr 23, 2026

Uh oh!

HumairAK commented Apr 24, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mprahl commented Apr 24, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HumairAK Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HumairAK May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HumairAK May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B-Step62 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HumairAK commented Apr 22, 2026 •

edited

Loading

HumairAK Apr 29, 2026 •

edited

Loading

HumairAK May 3, 2026 •

edited

Loading

HumairAK May 3, 2026 •

edited

Loading