vLLM 0.15.1 crashes on T4 GPU — CUTLASS DSL fails to detect sm_75 compute capability

vLLM 0.15.1 (current latest) cannot serve Fara-7B on a T4 GPU (sm_75, 15GB VRAM) — the most common free GPU available via Google Colab. The CUTLASS DSL compiler fails to detect the T4's compute capability, causing a crash during engine initialization.

**Environment**
- GPU: NVIDIA T4 (Google Colab free tier), 15GB VRAM, sm_75
- vLLM: 0.15.1
- PyTorch: 2.9.1+cu128
- Python: 3.12
- OS: Ubuntu (Colab default)

**Error**
(EngineCore_DP0 pid=5369) Starting to load model microsoft/Fara-7B...
(EngineCore_DP0 pid=5369) ERROR EngineCore failed to start.
  File "nvidia_cutlass_dsl/python_packages/cutlass/base_dsl/compiler.py", line 148, in compile
    pm.run(module.operation)
cutlass._mlir._mlir_libs._site_initialize.<locals>.MLIRError: Failure while executing pass pipeline:
error: unknown: failed to verify the compilation unit (error 7: NVVM_ERROR_INVALID_OPTION),
libNVVM extra log: libnvvm : error: -arch=compute_ is an unsupported option


Attempted workarounds (all failed)

1. --enforce-eager	Same CUTLASS error — it still tries to compile
2. VLLM_USE_V1=0 (V0 engine)	Same error
3. TORCH_CUDA_ARCH_LIST=7.5	Same error — env var not picked up by CUTLASS DSL
4. SGLang as alternative	CuDNN version mismatch, then OOM on FP16
5. --max-model-len 8192 --gpu-memory-utilization 0.98	Crashes before reaching model loading
 
Impact
The T4 is the default free GPU on Google Colab, making it the most accessible option for users who:

- Don't have a local GPU with 24GB+ VRAM
- Can't access Azure Foundry
- Want to try Fara-7B without hardware investment
The README recommends vLLM for self-hosting but doesn't document this incompatibility.

Suggested fixes

- Document T4 incompatibility in the README's self-hosting section
- Pin a known-working vLLM version (e.g., 0.8.x) in requirements or document it
- Provide an AWQ-quantized model — FP16 Fara-7B (14.5GB weights) leaves almost no room for KV cache on T4's 15GB, even if vLLM worked. A 4-bit quantized version that preserves the vision encoder would make T4 viable.
- Add a Colab notebook to the repo (see related issue)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM 0.15.1 crashes on T4 GPU — CUTLASS DSL fails to detect sm_75 compute capability #57

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vLLM 0.15.1 crashes on T4 GPU — CUTLASS DSL fails to detect sm_75 compute capability #57

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions