Skip to content

psaisan/TRACE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TRACE: Translator Representation Analysis of Ceilings and Efficiency

Preprint: Saisan & Patel (2026). Molecular Translators as a Computational Primitive for Biomarker Discovery: Learnability Gains Under Conserved Information Ceilings. bioRxiv. https://doi.org/10.64898/2026.04.27.720188

If you use TRACE in your work, please cite the preprint above — BibTeX at the bottom.



High-fidelity molecular translator systems — tools that transform routine H&E slides into virtual molecular maps — are emerging as powerful primitives for biomarker modeling and discovery, particularly as a way to engage the field's recurring prediction plateau.

Their apparent success, however, invites a consequential misconception. As the line between virtual and measured molecular maps begins to blur, virtual fidelity drifts toward an intuitive assumption of newly recovered molecular information.

That assumption is wrong.

A molecular translator is a deterministic map of morphology. However faithful, it cannot introduce new slide-specific information at inference. The H&E deployment ceiling is conserved.

That is not a negative result.

Translators still carry a potentially transformative advantage — as force multipliers on biomarker prediction learning. But translator-driven learnability gains arise in the absence of new information, producing an information–performance paradox whose resolution has real consequences for how the field moves forward and invests its resources.

Without formal scaffolding, these gains will be routinely conflated with added information — misdirecting developmental effort in precision medicine where molecularly actionable decisions depend on predictors learned from pathology. The field risks exhausting resources against a ceiling it cannot see.


The underlying problem is not specific to pathology. Wherever a computational surrogate stands in for a richer or more costly measurement at deployment — virtual staining, spatial transcriptomics imputation, single-cell modality transfer, remote sensing proxies — the same paradox applies: apparent performance gains arising not from new information but from better-organized existing information. TRACE was designed to be agnostic to domain; the ceiling–gap decomposition and ARC diagnostic apply to any paired-data-trained proxy deployed in place of a direct measurement.


Why TRACE is here now?

Timing of TRACE's development is tied to an emerging innovation in molecular data analytics. Translated molecular intermediates are emerging as a potentially game-changing computational primitive for biomarker modeling and discovery. TRACE is built for studying this setting: when a downstream target $Y$ may be better predicted indirectly through a translated representation $h(X)$ derived from $X$, rather than directly from its original deployable representation $X$.

Many biomarker pipelines now follow the pattern:

deployable representation X  →  translator  →  translated representation h(X)
Deployable representation $X$ Translated representation $h(X)$
H&E slide embeddings predicted gene expression
H&E embeddings predicted proteomics
H&E embeddings predicted immune signatures
morphology features predicted pathway activity
pathology foundation-model tokens cross-modal biological embeddings

The systems that anchor TRACE's empirical validation are MISO (Nat. Commun. 2025), which translates H&E to spatial gene expression, and GigaTIME (Cell 2025), which translates H&E to virtual multiplex immunofluorescence. GigaTIME alone was applied to 14,256 patients across 51 hospitals — a scale of virtual molecular mapping previously infeasible without translation. A Microsoft Research blog post and interactive demo provide accessible entry points to GigaTIME.

These are exactly the systems whose gains TRACE is designed to interpret correctly.


What TRACE Does

TRACE provides the mathematical framework and computational tools to resolve this paradox operationally.

Ceiling–gap decomposition formally separates the Bayes-optimal information ceiling (modality-limited, irreducible) from finite-sample method gaps (recoverable through better supervision, more labels, or improved architectures). Because a deployed translator is a deterministic function of morphology, the data processing inequality gives

$$I(Y;\hat{Z}) \leq I(Y;X)$$

and accordingly

$$\mathrm{AUC}^{*}(\hat{Z}) \leq \mathrm{AUC}^{*}(X)$$

Any practical gain from h(X) over X must therefore be interpreted as a learnability effect, not as recovery of new deployment-time signal.

Falsifiable signatures distinguish method-limited from modality-limited regimes, validated in controlled analytical experiments anchored to MISO and GigaTIME.

The Advantage Representation Curve (ARC) is the toolkit's primary diagnostic.

TRACE computes paired learning curves

$$A_X(n) \equiv \mathrm{AUC}_n(X), \qquad A_H(n) \equiv \mathrm{AUC}_n(h(X))$$

and their difference

$$\mathrm{ARC}(n) = A_H(n) - A_X(n)$$

indexed by label budget $n$. The shape of ARC across $n$ is typically far more informative than any single-endpoint benchmark.

Decision support — TRACE turns ARC geometry into actionable interpretation:

ARC pattern Interpretation Suggested action
Positive at small $n$, decaying toward zero Translation improves sample efficiency Collect more labels if feasible
Positive across the studied range Translated representation retains practical value Improve or exploit the translator
Near zero throughout Translation adds little downstream value Prefer direct learning on $X$
Negative or unfavorably sign-reversing Translation is lossy or distorting Avoid translated representation for this task

These are empirical reference regimes, not rigid bins. Real studies may fall between them.


Installation

git clone https://github.com/psaisan/TRACE
cd TRACE
pip install -e .

Note: Python's standard library includes a module named trace. Run notebooks and scripts from the repository root to avoid import collisions.


Getting Started

The recommended entry point is the notebook sequence in Notebooks/:

  1. 00_overview_and_reference_scenarios.ipynb — TRACE outputs and reference regimes
  2. 01_sample_efficiency.ipynb — positive low-label ARC, decaying toward zero
  3. 02_persistent_advantage.ipynb — sustained positive ARC across label budgets
  4. 03_no_advantage.ipynb — near-null ARC
  5. 04_lossy_translation.ipynb — negative ARC / failure regime
  6. 05_custom_scenario_playground.ipynb — user-defined scenarios

For paper-style synthetic outputs, start with notebook 00.


Practical Workflow

A typical applied setting:

X = H&E embeddings
H = translated features (e.g. predicted gene expression from an H&E→RNA translator)
Y = downstream label (e.g. mutation status, response class)

Provide X, H, and y; specify a label-budget grid n_grid; fit and evaluate paired downstream models across repeated subsamples. TRACE returns three coordinated outputs for each run: paired learning curves with uncertainty bands, the ARC curve, and a regime-score panel quantifying similarity to the four canonical regimes.

TRACE reference outputs across four canonical ARC regimes

Figure 11 from the accompanying manuscript. Each row shows paired learning curves (left), the ARC (middle), and regime scores (right) for the four canonical regimes: quick-gain, sustained-gain, neutral, and impaired.

For a fully executable demonstration reproducing all Figure 11 panels with a single command. If you are reviewer request trace_reviewer_demo.py - to be included in the repository root after manuscript review completion.


Citation

@article{saisan2026moleculartranslators,
  author  = {Saisan, Payam A. and Patel, Sandip Pravin},
  title   = {Molecular Translators as a Computational Primitive for Biomarker
             Discovery: Learnability Gains Under Conserved Information Ceilings},
  journal = {bioRxiv},
  year    = {2026},
  doi     = {10.64898/2026.04.27.720188},
  url     = {https://doi.org/10.64898/2026.04.27.720188}
}

License

MIT

About

Translator Representation Analysis of Ceilings and Sample Efficiency

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors