Lambda out of capacity #617
Replies: 4 comments 10 replies
-
|
Yea, they're really hard to come by these days 😞 I've been training on a 1x A100 though, speedrun takes about 14h. Would be interested to hear what issues you ran into? |
Beta Was this translation helpful? Give feedback.
-
|
Same as you, I used the gpu_8x_a100_80gb_sxm4 instance on Lambda Cloud. When I ran bash runs/speedrun.sh, it failed in the base_train.py section with the error: "ValueError: type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')". GPT suggested trying export TORCH_COMPILE_DISABLE=1 before running the command, but I haven't had access to a new server to test this yet, so I'm not sure if other errors will pop up. |
Beta Was this translation helpful? Give feedback.
-
|
Hi! Is there a guide to config policies by compute? I.e., that answers what parameters (depth, other size params, window pattern, etc.) should I use under a given compute? Or, even better :) an auto-optimizing branch? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I can't get 8x H100s on Lambda and attempting to train on 8x A100 didn't work out (perhaps I'll post about that separately) -- and any rate right now even those are not available. Wondering what some of the alternatives people have used are.
Beta Was this translation helpful? Give feedback.
All reactions