RTX 4070 Laptop GPU + WSL2 run — 282 experiments, 2.770 → 2.495 #448
radozaprazny
started this conversation in
Show and tell
Replies: 1 comment
-
|
Great results on a laptop GPU! The torch.compile overhead fix is really practical. And the depth 6 vs 9 finding is interesting. It shows how much the optimal config shifts on consumer hardware, where you get fewer steps in the same time budget, so a smaller model with a healthier LR schedule actually wins. I've been thinking about this problem from a different angle. I built a platform where you can run experiments on cloud GPUs without owning one, funded by anyone who thinks the experiment is worth running. Wrote a longer post here: #452 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hardware: ASUS ROG NUC, RTX 4070 Laptop GPU (8 GB VRAM), Windows 11 + WSL2, CUDA 13.1.
Results: 282 experiments, 38 keeps, val_bpb 2.770532 → 2.495532 (~10% improvement).
Full session report with config tables, progress curve, and per-experiment breakdown:
👉 https://github.com/radozaprazny/autoresearch
Key findings:
torch.compile fix: ~600s overhead eats the whole 5-min budget — recovered by excluding first 10 steps from the timer (step > 10 guard). Critical for consumer GPUs.
Optimal depth is 6, not 9 — at ~2000 steps the LR schedule degenerates with almost the entire run in cooldown
Token shift K-only 1/4 channels confirmed (−0.021 bpb); GB10's optimal was 1/8 — platform-specific sweet spot
Separate WD param groups from #43 confirmed (−0.001 bpb)
WSL2: zero code changes needed, upstream runs out of the box
Beta Was this translation helpful? Give feedback.
All reactions