-
Notifications
You must be signed in to change notification settings - Fork 383
Description
vLLM 0.15.1 (current latest) cannot serve Fara-7B on a T4 GPU (sm_75, 15GB VRAM) — the most common free GPU available via Google Colab. The CUTLASS DSL compiler fails to detect the T4's compute capability, causing a crash during engine initialization.
Environment
- GPU: NVIDIA T4 (Google Colab free tier), 15GB VRAM, sm_75
- vLLM: 0.15.1
- PyTorch: 2.9.1+cu128
- Python: 3.12
- OS: Ubuntu (Colab default)
Error
(EngineCore_DP0 pid=5369) Starting to load model microsoft/Fara-7B...
(EngineCore_DP0 pid=5369) ERROR EngineCore failed to start.
File "nvidia_cutlass_dsl/python_packages/cutlass/base_dsl/compiler.py", line 148, in compile
pm.run(module.operation)
cutlass._mlir._mlir_libs.site_initialize..MLIRError: Failure while executing pass pipeline:
error: unknown: failed to verify the compilation unit (error 7: NVVM_ERROR_INVALID_OPTION),
libNVVM extra log: libnvvm : error: -arch=compute is an unsupported option
Attempted workarounds (all failed)
- --enforce-eager Same CUTLASS error — it still tries to compile
- VLLM_USE_V1=0 (V0 engine) Same error
- TORCH_CUDA_ARCH_LIST=7.5 Same error — env var not picked up by CUTLASS DSL
- SGLang as alternative CuDNN version mismatch, then OOM on FP16
- --max-model-len 8192 --gpu-memory-utilization 0.98 Crashes before reaching model loading
Impact
The T4 is the default free GPU on Google Colab, making it the most accessible option for users who:
- Don't have a local GPU with 24GB+ VRAM
- Can't access Azure Foundry
- Want to try Fara-7B without hardware investment
The README recommends vLLM for self-hosting but doesn't document this incompatibility.
Suggested fixes
- Document T4 incompatibility in the README's self-hosting section
- Pin a known-working vLLM version (e.g., 0.8.x) in requirements or document it
- Provide an AWQ-quantized model — FP16 Fara-7B (14.5GB weights) leaves almost no room for KV cache on T4's 15GB, even if vLLM worked. A 4-bit quantized version that preserves the vision encoder would make T4 viable.
- Add a Colab notebook to the repo (see related issue)