Lambda out of capacity #617

pradiptamitra · 2026-03-10T02:21:57Z

pradiptamitra
Mar 10, 2026

I can't get 8x H100s on Lambda and attempting to train on 8x A100 didn't work out (perhaps I'll post about that separately) -- and any rate right now even those are not available. Wondering what some of the alternatives people have used are.

svlandeg · 2026-03-10T08:47:58Z

svlandeg
Mar 10, 2026
Collaborator

Yea, they're really hard to come by these days 😞

I've been training on a 1x A100 though, speedrun takes about 14h. Would be interested to hear what issues you ran into?

6 replies

kschwethelm Mar 10, 2026

Hey, the code actually also works with FA2. This worked for me:

1. Add flash attention wheels to [tool.uv.sources] in pyproject.toml

# target torch to cuda 12.8 or CPU
[tool.uv.sources]
torch = [
    { index = "pytorch-cpu", extra = "cpu" },
    { index = "pytorch-cu128", extra = "gpu" },
]
flash-attn = { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl" }

...and then add "flash-attn" to the dependencies = [...] list in the same file. For example:

dependencies = [
    "flash-attn",
    "datasets>=4.0.0",
    "fastapi>=0.117.1",
    ...
]

2. Load FA2 (flash-attn) in nanochat/flash_attention.py
The simplest hack is just to replace this:

nanochat/nanochat/flash_attention.py

Lines 23 to 38 in f068604

    
           def _load_flash_attention_3(): 
        
               """Try to load Flash Attention 3 (requires Hopper GPU, sm90).""" 
        
               if not torch.cuda.is_available(): 
        
                   return None 
        
               try: 
        
                   major, _ = torch.cuda.get_device_capability() 
        
                   # FA3 kernels are compiled for Hopper (sm90) only 
        
                   # Ada (sm89), Blackwell (sm100) need SDPA fallback until FA3 is recompiled 
        
                   if major != 9: 
        
                       return None 
        
                   import os 
        
                   os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1" 
        
                   from kernels import get_kernel 
        
                   return get_kernel('varunneal/flash-attention-3').flash_attn_interface 
        
               except Exception: 
        
                   return None

with this:

def _load_flash_attention_3():
    """Load Flash Attention 2"""
    if not torch.cuda.is_available():
        return None
    try:
        # Removed compatibility check for simplicity
        import flash_attn
        return flash_attn
    except Exception:
        return None

You could also integrate this more nicely with an automatic check :)

svlandeg Mar 10, 2026
Collaborator

~~This surprises me, the current code has~~

~~repo = "varunneal/flash-attention-3" if cap == (9, 0) else "kernels-community/flash-attn3"~~

~~and the community kernel should work just fine on an A100, I ran it before.~~

svlandeg Mar 10, 2026
Collaborator

To reduce memory, try training a smaller model (decrease depth) & batch-size.

pradiptamitra Mar 11, 2026
Author

I went off on an unproductive (and costly!) tangent when I noticed that flash attention wasn't working with the A100s.

This surprises me, the current code has
repo = "varunneal/flash-attention-3" if cap == (9, 0) else "kernels-community/flash-attn3"
and the community kernel should work just fine on an A100, I ran it before.

not sure what happened, but here are relevant log lines showing both the architecture and the failure.

9912-Autodetected device type: cuda
9913-2026-03-08 05:05:04,059 - nanochat.common - INFO - Distributed world size: 8
9914:GPU: NVIDIA A100-SXM4-40GB | Peak FLOPS (BF16): 3.12e+14
9915:COMPUTE_DTYPE: torch.bfloat16 (auto-detected: CUDA SM 80 (bf16 supported))
9916-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
9917-WARNING: Flash Attention 3 not available, using PyTorch SDPA fallback
9918-WARNING: Training will be less efficient without FA3
9919-WARNING: SDPA has no support for sliding window attention (window_pattern='SSSL'). Your GPU utilization will be terrible.

It's actually this made me think that the OOMs were a function of FA not being available rather the other stuff. I'll try the suggestions with...vast.ai maybe.

svlandeg Mar 11, 2026
Collaborator

No, sorry, I was messing up this repo nanochat with autoresearch, only the latter has the community FA3 kernel for A100.

Anyway so ignore my previous message.

What you see here is OK by itself, you can run the SDPA fallback. But you want to set your window pattern to "L", so something like

python -m scripts.base_train --depth=12 --target-param-data-ratio=9.5 --device-batch-size=16 --window-pattern="L" --run=$WANDB_RUN

tongyan160-ux · 2026-03-10T08:52:10Z

tongyan160-ux
Mar 10, 2026

Same as you, I used the gpu_8x_a100_80gb_sxm4 instance on Lambda Cloud. When I ran bash runs/speedrun.sh, it failed in the base_train.py section with the error: "ValueError: type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')". GPT suggested trying export TORCH_COMPILE_DISABLE=1 before running the command, but I haven't had access to a new server to test this yet, so I'm not sure if other errors will pop up.

3 replies

svlandeg Mar 10, 2026
Collaborator

Ah! Yes, that makes sense. In speedrun on an A100, remove --fp8 from the base train command. You'll also want to lower the depth. Try with a d12 to start, for instance (--depth=12), I think you should be able to go up to a d18. You'll also want to set --window-pattern="L".

So something like this:

python -m scripts.base_train --depth=12 --target-param-data-ratio=9.5 --device-batch-size=16 --window-pattern="L" --run=$WANDB_RUN

(or use torchrun if you have 8x)

pradiptamitra Mar 10, 2026
Author

What about the stages after base train (sft and so forth)? any changes you needed there?

svlandeg Mar 10, 2026
Collaborator

Not compared to the current code in speedrun.sh, no. You can even remove the --device-batch-size=16 arg from the chat_sft call, as this is now inherited from the setting you used for base_train (cf also my PR here). So if you want to change this setting, you wouldn't have to do it twice.

christophergeyer · 2026-03-10T13:12:05Z

christophergeyer
Mar 10, 2026

Hi! Is there a guide to config policies by compute? I.e., that answers what parameters (depth, other size params, window pattern, etc.) should I use under a given compute? Or, even better :) an auto-optimizing branch?

1 reply

svlandeg Mar 10, 2026
Collaborator

You could start with runcpu.sh first, see that it all runs, then shift to speedrun.sh. Most parameters do adjust themselves automatically depending on the depth of the model you require - start with a super small one (d6 orso) first to see whether it goes through training properly, then go up.

There's a few settings you need to adjust depending on your system, like removing --fp8 or setting the window-pattern.

Having a separate script in the repo for all possible systems would probably be a bit much, but we can collect some best practices here? (cf also my previous post ☝️ )

Lambda out of capacity #617

Uh oh!

pradiptamitra Mar 10, 2026

Replies: 4 comments · 10 replies

Uh oh!

Uh oh!

svlandeg Mar 10, 2026 Collaborator

Uh oh!

kschwethelm Mar 10, 2026

Uh oh!

Uh oh!

svlandeg Mar 10, 2026 Collaborator

Uh oh!

svlandeg Mar 10, 2026 Collaborator

Uh oh!

pradiptamitra Mar 11, 2026 Author

Uh oh!

svlandeg Mar 11, 2026 Collaborator

Uh oh!

tongyan160-ux Mar 10, 2026

Uh oh!

Uh oh!

svlandeg Mar 10, 2026 Collaborator

Uh oh!

pradiptamitra Mar 10, 2026 Author

Uh oh!

Uh oh!

svlandeg Mar 10, 2026 Collaborator

Uh oh!

christophergeyer Mar 10, 2026

Uh oh!

Uh oh!

svlandeg Mar 10, 2026 Collaborator

pradiptamitra
Mar 10, 2026

Replies: 4 comments 10 replies

svlandeg
Mar 10, 2026
Collaborator

svlandeg Mar 10, 2026
Collaborator

svlandeg Mar 10, 2026
Collaborator

pradiptamitra Mar 11, 2026
Author

svlandeg Mar 11, 2026
Collaborator

tongyan160-ux
Mar 10, 2026

svlandeg Mar 10, 2026
Collaborator

pradiptamitra Mar 10, 2026
Author

svlandeg Mar 10, 2026
Collaborator

christophergeyer
Mar 10, 2026

svlandeg Mar 10, 2026
Collaborator