Deeply Expanded Capabilities (CUDA + MLX, tested with Blackwell and FA4 + Apple M5)) #184
Replies: 3 comments
-
|
Quick update on where this has gone since I opened this. The biggest shift is that this is no longer just “a refined MLX port.” It has turned into a measured platform-calibration stack that starts on MLX/Metal, but is increasingly organized around a more general problem:
A few changes were especially important. 1. Eval efficiency became a primary focusOn local MLX hardware, eval became much heavier. It could dominate the wall clock of the whole autoresearch loop, so it directly affects:
The important part is that I did not just make eval cheaper. I grounded it against a full upstream-shaped target and then measured the best local tradeoff against that target. Concretely:
From there I built a rung ladder:
And I measured the error/time tradeoff against the full upstream-style baseline instead of guessing:
That is the real reason eval efficiency became such a major focus: the goal was to find the best grounded local proxy for the full upstream metric, not to invent a different easier metric. The runtime can now choose between 2. I optimized the real MLX bottlenecks, not just model codeA lot of the meaningful gains came from tightening the training/data/runtime system around the model:
Those changes matter because they improve the calibration/search loop itself, not just one benchmark, and they don't have any effect on equal-step validation quality so they're pure efficiency/speed wins. 3. The runtime is starting to make safer decisions on its ownA big practical problem on local hardware is that the “right” settings are not stable forever. If you:
then the old calibration can stop being trustworthy. So instead of just hardcoding one set of local defaults and hoping they stay good, the runtime is starting to behave more like this:
The point is simple: a user should not have to know the entire calibration history of the repo to get sane behavior. The system should increasingly know when it is on familiar ground and when it should be cautious. 4. There is now a one-button bring-up pathThis is probably the most important architectural shift. There is now a bring-up flow that can:
So the story is becoming:
That is a much bigger step toward a hardware-agnostic autoresearch platform than “here is a Metal fork that runs.” 5. The same machinery now covers post-change revalidationAnother big shift is that calibration is no longer only about new hardware. If I change something substantial in the training stack:
then the question is no longer just “does training still run?” It is also:
So the calibration stack is now starting to handle both:
That is a necessary step if the longer-term goal is a platform that can grow across backends without every variant turning into a manually maintained fork. SummarySo the meaningful gains here are not just raw speedups. The bigger shift is that the project is moving from:
toward:
The abstractions are increasingly about:
It is still MLX-first today, but the foundation is starting to look much more platform-extensible than backend-specific. |
Beta Was this translation helpful? Give feedback.
-
Instrumentation Comparison: autoresearch vs autoresearch-everywhere
Summary
Essentially, |
Beta Was this translation helpful? Give feedback.
-
|
I've now tested bring-up on a DGX Spark / GB10 CUDA system with FlashAttention 4 and the calibration should give you a model baseline around |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I love this. I was working on something similar when the project dropped, and I've pivoted to extending this. I want to make it a more platform-extensible version. I started with making a refined MLX port and have been focused primarily on optimizations that will help with further GPU training inclusions.
I have also integrated a notion called "Autonomy Golf" which is how I've been driving the development cycle, and is something you can do on any project which you want to automate more. It's both a scoring system for how fully automated your project is and a process your agent can adopt to improve insights into the development cycle.
https://github.com/Entrpi/autoresearch-everywhere
Beta Was this translation helpful? Give feedback.
All reactions