Skip to content

Update RFC-0001 for archival locking and archival barrier semantics#9

Open
HumairAK wants to merge 2 commits intomlflow:mainfrom
HumairAK:trace_archival_coordination
Open

Update RFC-0001 for archival locking and archival barrier semantics#9
HumairAK wants to merge 2 commits intomlflow:mainfrom
HumairAK:trace_archival_coordination

Conversation

@HumairAK
Copy link
Copy Markdown
Contributor

@HumairAK HumairAK commented Apr 22, 2026

Summary

This PR updates RFC-0001 for trace archival to reflect the current design for both archival payload location semantics and archival/write coordination.

In particular, it:

  • keeps mlflow.artifactLocation reserved for the trace's ordinary artifact root, including attachments
  • introduces mlflow.trace.archiveLocation as the archive-specific location for archived traces.pb payloads
  • updates retrieval, deletion, and repository-ingestion semantics to consistently use the archive-specific location when SpansLocation=ARCHIVE_REPO
  • clarifies the existing artifact-backed trace path so the RFC distinguishes ordinary trace artifacts from archival payload storage
  • replaces the earlier transient archival-barrier approach with version-based coordination using a per-trace trace_version
  • documents atomic archival finalization semantics so concurrent DB-backed writes and archival cannot leave a trace in a mixed DB/archive state
  • clarifies crash recovery and retry behavior when archival is interrupted or loses a race with a concurrent write
  • updates the implementation-impact and migration notes to reflect the new archive-location tag and trace_version coordination model

Implementation context

Related implementation details:

7. **Finalize archival atomically:** Update `spans.content` to an empty string, set the trace tag `mlflow.trace.spansLocation` = `SpansLocation.ARCHIVE_REPO`, and clear the transient `mlflow.trace.archiving` tag.
8. **Clear the emergency tag:** If the archival was triggered by `archive now` and the emergency archival pass completes successfully, clear the tag so it behaves as a one-shot failsafe.

`log_spans()` must cooperate with this barrier. In particular, DB-backed span writes should check for blocked traces both before and after their batch upserts so a write that races with archival fails and rolls back instead of partially succeeding and then reasserting `TRACKING_STORE` after archival has begun.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be optimized so that the span insert fails by using a where clause on the archive in progress tag not being set? This would avoid the additional read.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this wouldn't be enough to coordinate the writes. In the current SQL-backed implementation, we have to coordinate archival and log_spans() more broadly than just making the insert conditional. Once archival takes ownership of a trace, conflicting DB-backed writes are rejected so they can’t partially succeed and then reassert TRACKING_STORE.

The relevant code can be found here.

@mprahl
Copy link
Copy Markdown
Collaborator

mprahl commented Apr 23, 2026

@HumairAK I'd suggest that future RFC updates should be less implementation detail heavy and be focused more on the decisions to be made.

You could have avoided my comment 😆 : #9 (comment)

@HumairAK
Copy link
Copy Markdown
Contributor Author

I agree, my original intention was to just draw attention that we'll need to add coordination to existing log_spans() but I think it went a bit too into the weeds, I updated the doc to focus more on the semantics of coordination. See the last commit.

@HumairAK HumairAK requested a review from mprahl April 24, 2026 13:25
@HumairAK HumairAK force-pushed the trace_archival_coordination branch from 598d41e to fa230c4 Compare April 24, 2026 20:45
Comment thread rfcs/0001-trace-archival/0001-trace-archival.md Outdated
```python
def upgrade():
with op.batch_alter_table("trace_info", schema=None) as batch_op:
batch_op.add_column(sa.Column("trace_version", sa.BigInteger(), nullable=True))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated after the latest commit, right?

@HumairAK HumairAK force-pushed the trace_archival_coordination branch from bd35d41 to f5e6b65 Compare April 24, 2026 21:19
@mprahl
Copy link
Copy Markdown
Collaborator

mprahl commented Apr 24, 2026

@B-Step62 could you please take a look at this?

Keep trace artifact roots and archival payload locations distinct so archived traces do not interfere with existing trace artifacts or attachments.
Comment thread rfcs/0001-trace-archival/0001-trace-archival.md Outdated
**Use a per-trace DB-backed payload version.** This coordination is needed because archival is no longer just a single final state flip. The server must first select the trace and snapshot its current DB-backed payload version, then read the DB-backed spans it intends to archive, then upload `traces.pb`, and only then decide whether it may finalize. Each successful DB-backed write that changes the DB-backed trace payload must atomically advance this version so archival finalization can detect whether the payload changed after it was selected. A DB-backed write must only succeed while the trace still remains DB-backed.

The archival repository URI for archived traces (where to read `traces.pb`) continues to be recorded via the existing `mlflow.artifactLocation` tag. The `mlflow.trace.spansLocation` tag disambiguates how that URI is interpreted: when span data is in `ARTIFACT_REPO`, it points to the existing artifact-backed trace path; when span data is in `ARCHIVE_REPO`, it points to the archived file location in the archival repository for `traces.pb`. The **effective archival repository root** for a given trace is the workspace's `trace_archival_location` when set, otherwise the server's global `--trace-archival-location` (or default artifact root).
Add a `trace_version` column on `trace_info` (or an equivalent per-trace generation field) for this purpose, logically defaulting to `0`. This version is MLflow-internal coordination state for DB-backed traces and is not part of the archived OTLP `TracesData` payload. It tracks the generation of the DB-backed span payload, and any DB-backed trace state derived from `log_spans()` writes, while the trace is in `TRACKING_STORE`; unrelated metadata changes and archival's own `SpansLocation` transition do not need to advance it.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to trace_version after trace is archived? If later trace is restored back to TRACKING_STORE does it pick up from the old version or reset?

Copy link
Copy Markdown
Contributor Author

@HumairAK HumairAK Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Restore is future work and out of scope here, so trace_version semantics after restore are intentionally unspecified. (See out of scope section).

But for future work, my preference would be reset the version instead of advancing it.

Document version-based coordination between archival finalization and concurrent DB-backed writes so traces cannot end up with mixed DB and archive payload state.
@HumairAK HumairAK force-pushed the trace_archival_coordination branch from 01df0db to aca54cd Compare April 29, 2026 18:46
Traces stay in `TRACKING_STORE` until archival completes (export to the archival repository + clear DB content + set tag to `ARCHIVE_REPO`). If the process crashes mid-archival, the trace remains `TRACKING_STORE` and will be selected again on the next run (re-export overwrites the file, then clear and set tag).
**Use a per-trace DB-backed payload version.** This coordination is needed because archival is no longer just a single final state flip. The server must first select the trace and snapshot its current DB-backed payload version, then read the DB-backed spans it intends to archive, then upload `traces.pb`, and only then decide whether it may finalize. Each successful DB-backed write that changes the DB-backed trace payload must atomically advance this version so archival finalization can detect whether the payload changed after it was selected. A DB-backed write must only succeed while the trace still remains DB-backed.

Add a `trace_version` column on `trace_info` (or an equivalent per-trace generation field) for this purpose, logically defaulting to `0`. This version is MLflow-internal coordination state for DB-backed traces and is not part of the archived OTLP `TracesData` payload. It tracks the generation of the DB-backed span payload, and any DB-backed trace state derived from `log_spans()` writes, while the trace is in `TRACKING_STORE`; unrelated metadata changes and archival's own `SpansLocation` transition do not need to advance it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this as a part of existing trace metadata key-value map? I don't feel like the addition of simple version metadata without heavy query requirement justifies the new top level field and schema migration.

Copy link
Copy Markdown
Contributor Author

@HumairAK HumairAK May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @B-Step62, using trace metadata to resolve the trace write coordination was my initial approach. The reasons for why I decided to go a different route are addressed here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I also thought about it again and convinced that the new integer column is simpler approach. Using metadata has several other headaches e.g. string casting, handling upsert, etc.

Traces stay in `TRACKING_STORE` until archival completes (export to the archival repository + clear DB content + set tag to `ARCHIVE_REPO`). If the process crashes mid-archival, the trace remains `TRACKING_STORE` and will be selected again on the next run (re-export overwrites the file, then clear and set tag).
**Use a per-trace DB-backed payload version.** This coordination is needed because archival is no longer just a single final state flip. The server must first select the trace and snapshot its current DB-backed payload version, then read the DB-backed spans it intends to archive, then upload `traces.pb`, and only then decide whether it may finalize. Each successful DB-backed write that changes the DB-backed trace payload must atomically advance this version so archival finalization can detect whether the payload changed after it was selected. A DB-backed write must only succeed while the trace still remains DB-backed.

Add a `trace_version` column on `trace_info` (or an equivalent per-trace generation field) for this purpose, logically defaulting to `0`. This version is MLflow-internal coordination state for DB-backed traces and is not part of the archived OTLP `TracesData` payload. It tracks the generation of the DB-backed span payload, and any DB-backed trace state derived from `log_spans()` writes, while the trace is in `TRACKING_STORE`; unrelated metadata changes and archival's own `SpansLocation` transition do not need to advance it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name trace_version is too broad, we already have a metadata that represents the trace schema version. Can we use a different self-explanatory name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, trace_version doesn't exactly convey that this is intended as an internal payload generation to assess read freshness. How about something narrower like db_payload_generation?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea db_payload_generation works better!

- **`ARTIFACT_REPO`** (existing): Retained for existing behavior (e.g. V3 API traces whose spans are stored in the artifact store). Distinct from `ARCHIVE_REPO` for the proposed OTLP archival repository.

Traces stay in `TRACKING_STORE` until archival completes (export to the archival repository + clear DB content + set tag to `ARCHIVE_REPO`). If the process crashes mid-archival, the trace remains `TRACKING_STORE` and will be selected again on the next run (re-export overwrites the file, then clear and set tag).
**Use a per-trace DB-backed payload version.** This coordination is needed because archival is no longer just a single final state flip. The server must first select the trace and snapshot its current DB-backed payload version, then read the DB-backed spans it intends to archive, then upload `traces.pb`, and only then decide whether it may finalize. Each successful DB-backed write that changes the DB-backed trace payload must atomically advance this version so archival finalization can detect whether the payload changed after it was selected. A DB-backed write must only succeed while the trace still remains DB-backed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HumairAK Have we consider alternatives to this solution? I felt like the version-based check is a bit over-complex to the size of problem we're trying to avoid. In practice, it will be rare that users need to archive traces with a very short span after they are written as the archival job has a scheduling buffer anyway. Therefore, one option is just to mandate some buffer for the archival trace timestamp e.g. users can only archive traces after 5 minutes, or maybe document that the state consistency of archiving in-flight trace is best-effort.

Copy link
Copy Markdown
Contributor Author

@HumairAK HumairAK May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @B-Step62, I had the same concerns and had initially gone with a best-effort approach as well (this is what is currently in the feature branch).

I considered other simpler options, but the main issue here is not queryability, it is having a cheap internal freshness token during the archive_trace job that will undergo a DB read, archive upload, and finalization db-write step.

In the current db schema, since the “existing metadata map” is a separate key-value table, storing this there would make the co-ordination path more complex, not less. Since tags are user facing, this would also make public what should otherwise be an internal tracking implementation detail.

We also previously tried a transient archiving barrier using trace tags on the writer side, but that added steady-state overhead to log_spans() and extra stale-marker handling.

A minimum-age buffer can reduce race frequency, but it does not provide the same correctness guarantee. Note that log_spans() currently allows spans to be added even after they have reached a terminal state, so in such scenarios age buffer would be insufficient. The same applies to cases like archive_now = {}, which intentionally bypass age-based assumptions.

Yes the per-trace generation mainly about making archival finalization safe. But during implementation, it did also help in a secondary way by making start_trace() participate in the same storage-consistency protocol as other DB-backed writes. So if log_spans() created the DB-backed trace and start_trace() later fills in the authoritative trace info, that update now advances the generation and causes stale archival snapshots to lose. In that sense, the internal generation token is not as narrow as it may seem: it also provides a coordination mechanism that future DB-backed writers can participate in.

start_trace() and log_spans() already have to coordinate concurrent writes to the same trace, and archival introduces a third path where races can occur. So in my opinion, this is not an issue just isolated to trace archival. It is a case where the trace write path is becoming more concurrent overall, and having an explicit internal coordination protocol starts becoming worth it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed description! You're right, if the solution scopes the general race between start_trace and log_spans beyond an edge case in archival, the complexity is justified.

Copy link
Copy Markdown
Contributor

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants