Deeply Expanded Capabilities (CUDA + MLX, tested with Blackwell and FA4 + Apple M5)) #184

Entrpi · 2026-03-10T01:16:03Z

Entrpi
Mar 10, 2026

I love this. I was working on something similar when the project dropped, and I've pivoted to extending this. I want to make it a more platform-extensible version. I started with making a refined MLX port and have been focused primarily on optimizations that will help with further GPU training inclusions.

I have also integrated a notion called "Autonomy Golf" which is how I've been driving the development cycle, and is something you can do on any project which you want to automate more. It's both a scoring system for how fully automated your project is and a process your agent can adopt to improve insights into the development cycle.

https://github.com/Entrpi/autoresearch-everywhere

Entrpi · 2026-03-11T04:39:27Z

Entrpi
Mar 11, 2026
Author

Quick update on where this has gone since I opened this.

The biggest shift is that this is no longer just “a refined MLX port.” It has turned into a measured platform-calibration stack that starts on MLX/Metal, but is increasingly organized around a more general problem:

if someone clones autoresearch on a new machine, what model size, sequence length, batch structure, and eval policy should they start with?
how does the system figure that out from measurements instead of hand-tuning?
how does it know when old measurements are no longer safe to trust?
how does a new machine discover a good autoresearch operating point?
how does the runtime know when old calibration is still trustworthy?
how do we promote new defaults from measured evidence instead of hand-tuning each platform?

A few changes were especially important.

1. Eval efficiency became a primary focus

On local MLX hardware, eval became much heavier. It could dominate the wall clock of the whole autoresearch loop, so it directly affects:

how many experiments you can run
how honest your keep/discard decisions are
whether a local port is actually usable as a research platform

The important part is that I did not just make eval cheaper. I grounded it against a full upstream-shaped target and then measured the best local tradeoff against that target.

Concretely:

I kept the long-context semantic target at seq_len=2048
I remeasured the full upstream-style eval locally and found that on the M5 reference machine the right batch is 2, not 256
that reduced full eval from about 287.7s to about 183.8s with identical BPB, so the local reference itself got much better without changing the metric

From there I built a rung ladder:

cheap: 2048 / 262144 / batch 2
reference: 2048 / 1572864 / batch 2
full: 2048 / 20971520 / batch 2

And I measured the error/time tradeoff against the full upstream-style baseline instead of guessing:

cheap: about 2.96s, abs error about 0.002189
reference: about 17.34s, abs error about 0.000503
full: about 183.79s, zero error by definition

That is the real reason eval efficiency became such a major focus: the goal was to find the best grounded local proxy for the full upstream metric, not to invent a different easier metric. The runtime can now choose between cheap, reference, and full based on measured overhead, so short local runs stay fast while longer runs can afford to climb toward the full target.

2. I optimized the real MLX bottlenecks, not just model code

A lot of the meaningful gains came from tightening the training/data/runtime system around the model:

token cache path: roughly +2% to +4% throughput depending on preset
prepacked train path: roughly +5% to +6% tokens in matched 60s runs on m5-fast / m5-balanced
optimizer hot-path cleanup: about +3.8% on m5-large
streamed gradient accumulation: about -20.7% peak memory on m5-large, with lower first-step latency and flat steady-state throughput

Those changes matter because they improve the calibration/search loop itself, not just one benchmark, and they don't have any effect on equal-step validation quality so they're pure efficiency/speed wins.

3. The runtime is starting to make safer decisions on its own

A big practical problem on local hardware is that the “right” settings are not stable forever.

If you:

move to a different machine
change the model enough
change the runtime enough
or keep using old measurements long after the system changed

then the old calibration can stop being trustworthy.

So instead of just hardcoding one set of local defaults and hoping they stay good, the runtime is starting to behave more like this:

if it recognizes the machine and the preset shape, it can use measured local calibration
if the shape changed a lot, or the machine is different, it falls back more conservatively
if the evidence behind a calibration is thin or stale, it avoids over-trusting it

The point is simple: a user should not have to know the entire calibration history of the repo to get sane behavior. The system should increasingly know when it is on familiar ground and when it should be cautious.

4. There is now a one-button bring-up path

This is probably the most important architectural shift.

There is now a bring-up flow that can:

fingerprint the machine
sweep practical preset families
find a candidate operating zone
calibrate eval behavior on that machine
emit a report and a candidate new default for the autoresearch stage on that hardware

So the story is becoming:

clone on new hardware
run one calibration command
get lower / recommended / upper / reference zones
get a proposed default backed by measurements
compare that result to the checked-in M5 reference and the upstream-style reference

That is a much bigger step toward a hardware-agnostic autoresearch platform than “here is a Metal fork that runs.”

5. The same machinery now covers post-change revalidation

Another big shift is that calibration is no longer only about new hardware.

If I change something substantial in the training stack:

SwiGLU
attention implementation
compile strategy
batching behavior
checkpointing behavior
eval semantics

then the question is no longer just “does training still run?” It is also:

did the best operating point move?
did old eval calibrations become stale?
should the platform default change?

So the calibration stack is now starting to handle both:

new hardware bring-up
same hardware, but meaningful autoresearch changes

That is a necessary step if the longer-term goal is a platform that can grow across backends without every variant turning into a manually maintained fork.

Summary

So the meaningful gains here are not just raw speedups.

The bigger shift is that the project is moving from:

“an MLX port of autoresearch”

toward:

“a calibrated research platform that happens to start on MLX”

The abstractions are increasingly about:

eval fidelity vs speed tradeoffs
operating-point search
calibration confidence and drift
promotion of measured defaults
one-button bring-up on unfamiliar hardware

It is still MLX-first today, but the foundation is starting to look much more platform-extensible than backend-specific.

0 replies

Entrpi · 2026-03-13T22:33:53Z

Entrpi
Mar 13, 2026
Author

Instrumentation Comparison: autoresearch vs autoresearch-everywhere

Feature Area	`autoresearch` (karpathy)	`autoresearch-everywhere`
Step-level metrics	8 live metrics via `\r` print (loss, lr, dt, tok/sec, mfu, epoch, remaining, pct_done)	`StepTiming` dataclass with fine-grained breakdown (loader, grad, accumulate, optimizer, other) + separate warmup vs steady-state tracking
Timing	Simple `t0`/`t1` per step, startup time, total time	Per-component time decomposition with percentage breakdowns and median-filter-based steady-state window detection
FLOPs / Throughput	MFU against H100 peak; steady-state MFU (skips first 10 steps)	MFU + TFLOPS; per-component overhead percentages; steady-state windowing with configurable tolerance
Memory	`torch.cuda.max_memory_allocated()` peak VRAM	`peak_vram_mb` via `ProbeResult` + full `HardwareFingerprint` (cores, arch, compute capability, driver version, etc.)
Evaluation	`evaluate_bpb()` — fixed BPB metric, single rung	Three-rung eval policy (cheap/reference/full) with calibration, freshness tracking (30/90-day aging), signature matching, and `AutoEvalDecision` metadata
Eval telemetry	None	Dedicated `EvalTelemetryRecord` system to JSONL; `EvalTelemetrySummary` with median, relative MAD, and stability detection
Hardware detection	Assumes H100 (hardcoded peak)	Auto-detects platform (Apple Silicon / NVIDIA); generates hardware keys like `apple-m5-32gb-10gpu`; full `HardwareFingerprint`
Experiment tracking	TSV-based `results.tsv` (commit, val_bpb, memory, status, description)	Evidence ledger (JSONL) with typed events (verify, capture, trace, auto-review, deep-profile, integration-AB), timestamps, git commits, metric values, and evidence-bonus scoring
GPU tracing	None	Metal capture tracing (MLX) and NVIDIA Nsight/NCU profiling (CUDA) with kernel categorization (17 categories), relevance scoring, trace metadata
Deep profiling	None	`LabDeepProfileResult` with diagnosis (compute-bound, bandwidth-bound, under-occupied, mixed) and confidence scoring
Kernel-level analysis	None	`profile_mlx_targets()` with priority scoring (seq_factor, model_factor, repeat sites, evidence bonus); ranked candidate lists with rationales
A/B testing	None	`LabIntegrationABResult` for kernel-vs-kernel comparisons with delta metrics and measured pair counts
Curve projection	None	Power-law curve fitting with R², sigma, Monte Carlo sampling (20k samples), margin-to-best ranking
Platform calibration	None	Multi-stage calibration tool with fast/full modes, plateau detection, horizon reporting (60s–900s)
Checkpointing	None	`AsyncCheckpointWriter` with threading, error propagation, timing, and metadata snapshots
Failure detection	NaN / loss > 100 → exit(1)	Structured error tails in `ProbeResult`; crash events in evidence ledger
Post-hoc analysis	Jupyter notebook (`analysis.ipynb`) with cumulative improvement plots	Curve projection summaries and evidence ledger queries
Logging framework	Print statements only	Structured dataclasses to JSONL files; no external logging framework either, but much richer schema

Summary

autoresearch has a lean, sufficient instrumentation layer: live training metrics, simple timing, MFU, a TSV experiment log, and a Jupyter notebook. It's designed for a single-GPU (H100) automated research loop.

autoresearch-everywhere massively expands on this with:

Multi-platform awareness (Apple Silicon + NVIDIA) with hardware fingerprinting
Multi-rung evaluation with calibration, freshness, and confidence tracking
GPU trace capture and kernel-level profiling (Metal + Nsight)
An evidence ledger replacing the simple TSV, supporting typed events and bonus scoring
Curve projection for extrapolating and comparing runs under uncertainty
A/B integration testing for kernel comparisons
Deep profiling with bottleneck diagnosis

Essentially, autoresearch-everywhere turns what was a simple "print metrics and log to TSV" system into a full observability and decision-support platform for cross-hardware kernel optimization.

0 replies

Entrpi · 2026-03-14T03:33:44Z

Entrpi
Mar 14, 2026
Author

I've now tested bring-up on a DGX Spark / GB10 CUDA system with FlashAttention 4 and the calibration should give you a model baseline around 1.16 val_bpb after 5 minutes training to start the autoresearch loop with.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deeply Expanded Capabilities (CUDA + MLX, tested with Blackwell and FA4 + Apple M5)) #184

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Deeply Expanded Capabilities (CUDA + MLX, tested with Blackwell and FA4 + Apple M5)) #184

Uh oh!

Uh oh!

Entrpi Mar 10, 2026

https://github.com/Entrpi/autoresearch-everywhere

Replies: 3 comments

Uh oh!

Entrpi Mar 11, 2026 Author

1. Eval efficiency became a primary focus

2. I optimized the real MLX bottlenecks, not just model code

3. The runtime is starting to make safer decisions on its own

4. There is now a one-button bring-up path

5. The same machinery now covers post-change revalidation

Summary

Uh oh!

Entrpi Mar 13, 2026 Author

Instrumentation Comparison: autoresearch vs autoresearch-everywhere

Summary

Uh oh!

Entrpi Mar 14, 2026 Author

Entrpi
Mar 10, 2026

Entrpi
Mar 11, 2026
Author

Entrpi
Mar 13, 2026
Author

Entrpi
Mar 14, 2026
Author