Skip to content

vLLM 0.15.1 crashes on T4 GPU — CUTLASS DSL fails to detect sm_75 compute capability #57

@rmdodhia

Description

@rmdodhia

vLLM 0.15.1 (current latest) cannot serve Fara-7B on a T4 GPU (sm_75, 15GB VRAM) — the most common free GPU available via Google Colab. The CUTLASS DSL compiler fails to detect the T4's compute capability, causing a crash during engine initialization.

Environment

  • GPU: NVIDIA T4 (Google Colab free tier), 15GB VRAM, sm_75
  • vLLM: 0.15.1
  • PyTorch: 2.9.1+cu128
  • Python: 3.12
  • OS: Ubuntu (Colab default)

Error
(EngineCore_DP0 pid=5369) Starting to load model microsoft/Fara-7B...
(EngineCore_DP0 pid=5369) ERROR EngineCore failed to start.
File "nvidia_cutlass_dsl/python_packages/cutlass/base_dsl/compiler.py", line 148, in compile
pm.run(module.operation)
cutlass._mlir._mlir_libs.site_initialize..MLIRError: Failure while executing pass pipeline:
error: unknown: failed to verify the compilation unit (error 7: NVVM_ERROR_INVALID_OPTION),
libNVVM extra log: libnvvm : error: -arch=compute
is an unsupported option

Attempted workarounds (all failed)

  1. --enforce-eager Same CUTLASS error — it still tries to compile
  2. VLLM_USE_V1=0 (V0 engine) Same error
  3. TORCH_CUDA_ARCH_LIST=7.5 Same error — env var not picked up by CUTLASS DSL
  4. SGLang as alternative CuDNN version mismatch, then OOM on FP16
  5. --max-model-len 8192 --gpu-memory-utilization 0.98 Crashes before reaching model loading

Impact
The T4 is the default free GPU on Google Colab, making it the most accessible option for users who:

  • Don't have a local GPU with 24GB+ VRAM
  • Can't access Azure Foundry
  • Want to try Fara-7B without hardware investment
    The README recommends vLLM for self-hosting but doesn't document this incompatibility.

Suggested fixes

  • Document T4 incompatibility in the README's self-hosting section
  • Pin a known-working vLLM version (e.g., 0.8.x) in requirements or document it
  • Provide an AWQ-quantized model — FP16 Fara-7B (14.5GB weights) leaves almost no room for KV cache on T4's 15GB, even if vLLM worked. A 4-bit quantized version that preserves the vision encoder would make T4 viable.
  • Add a Colab notebook to the repo (see related issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions