RTX 4070 Laptop GPU + WSL2 run — 282 experiments, 2.770 → 2.495 #448

radozaprazny · 2026-03-30T13:28:52Z

radozaprazny
Mar 30, 2026

Hardware: ASUS ROG NUC, RTX 4070 Laptop GPU (8 GB VRAM), Windows 11 + WSL2, CUDA 13.1.

Results: 282 experiments, 38 keeps, val_bpb 2.770532 → 2.495532 (~10% improvement).

Full session report with config tables, progress curve, and per-experiment breakdown:
👉 https://github.com/radozaprazny/autoresearch

Key findings:

torch.compile fix: ~600s overhead eats the whole 5-min budget — recovered by excluding first 10 steps from the timer (step > 10 guard). Critical for consumer GPUs.
Optimal depth is 6, not 9 — at ~2000 steps the LR schedule degenerates with almost the entire run in cooldown
Token shift K-only 1/4 channels confirmed (−0.021 bpb); GB10's optimal was 1/8 — platform-specific sweet spot
Separate WD param groups from #43 confirmed (−0.001 bpb)
WSL2: zero code changes needed, upstream runs out of the box

nblintao · 2026-03-31T01:10:38Z

nblintao
Mar 31, 2026

Great results on a laptop GPU! The torch.compile overhead fix is really practical. And the depth 6 vs 9 finding is interesting. It shows how much the optimal config shifts on consumer hardware, where you get fewer steps in the same time budget, so a smaller model with a healthier LR schedule actually wins.

I've been thinking about this problem from a different angle. I built a platform where you can run experiments on cloud GPUs without owning one, funded by anyone who thinks the experiment is worth running. Wrote a longer post here: #452

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTX 4070 Laptop GPU + WSL2 run — 282 experiments, 2.770 → 2.495 #448

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RTX 4070 Laptop GPU + WSL2 run — 282 experiments, 2.770 → 2.495 #448

Uh oh!

radozaprazny Mar 30, 2026

Replies: 1 comment

Uh oh!

nblintao Mar 31, 2026

radozaprazny
Mar 30, 2026

nblintao
Mar 31, 2026