Skip to content

fix(ci): unbreak rerankers (torch bump) and vllm-omni on aarch64#9688

Merged
mudler merged 1 commit into
masterfrom
fix/ci-rerankers-vllm-omni-aarch64
May 6, 2026
Merged

fix(ci): unbreak rerankers (torch bump) and vllm-omni on aarch64#9688
mudler merged 1 commit into
masterfrom
fix/ci-rerankers-vllm-omni-aarch64

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Two independent CI breakages on master, bundled here because each fix is a one-liner.

rerankers backend tests fail (cpu / cublas12)

requirements-cpu.txt and requirements-cublas12.txt pin torch==2.4.1 but leave transformers unpinned. The latest transformers (5.x) registers a custom op in transformers/integrations/moe.py:

```python
torch.library.custom_op("transformers::grouped_mm_fallback", _grouped_mm_fallback, mutates_args=())
```

`_grouped_mm_fallback`'s signature uses string-typed annotations ('torch.Tensor'). torch 2.4.1's infer_schema does not understand them and raises ValueError: Parameter input has unsupported type torch.Tensor. That import error prevents the gRPC server from starting, so all 5 rerankers tests fail with Connection refused against 127.0.0.1:50051.

Fix: bump torch==2.4.1torch==2.7.1. Matches the pin used by the transformers backend in this repo and is the most common torch version across our Python backends. Leaves transformers unpinned so we keep tracking upstream.

vllm-omni build fails on aarch64

vllm-omni's setup.py resolves dependencies dynamically and loads requirements/cuda.txt on cuda hosts, which pins fa3-fwd==0.0.3. fa3-fwd ships only manylinux_2_24_x86_64 wheels and has no source distribution, so on aarch64 (e.g. l4t13 / SBSA cu130) uv fails with:

```
Because fa3-fwd==0.0.3 has no wheels with a matching platform tag (e.g.,
`manylinux_2_39_aarch64`) and vllm-omni==... depends on fa3-fwd==0.0.3,
we can conclude that vllm-omni cannot be used.
```

Fix: in install.sh, after cloning vllm-omni, strip fa3-fwd from requirements/cuda.txt when building on aarch64. This is safe — fa3-fwd is a soft runtime dep:

  • vllm_omni/diffusion/attention/backends/utils/fa.py wraps from fa3_fwd_interface import ... in try/except ImportError and falls back through FA3 source build → FA2 (flash_attn) → vLLM's wrapper.
  • vllm_omni/diffusion/attention/backends/flash_attn.py only raises if every FA backend is missing, with a message offering SDPA as a final fallback.

Test plan

  • CI green on tests-apple / tests-linux for the rerankers backend
  • aarch64 build (l4t13 / SBSA) reaches the pip install -e . step for vllm-omni without resolver failure

Assisted-by: Claude:claude-opus-4-7-1m [Claude Code]

Two unrelated CI breakages bundled together since both are one-liners:

- rerankers: bump torch 2.4.1 -> 2.7.1 on cpu/cublas12. The unpinned
  transformers resolves to 5.x, whose moe.py registers a custom_op with
  string-typed `'torch.Tensor'` annotations that torch 2.4.1's
  infer_schema rejects, blocking the gRPC server from starting and
  failing all 5 backend tests with "Connection refused" on :50051.
  Matches the version used by the transformers backend.

- vllm-omni: strip fa3-fwd from the upstream requirements/cuda.txt
  before resolving on aarch64. fa3-fwd 0.0.3 ships only an
  x86_64 wheel and has no sdist, making the cuda profile unsatisfiable
  on Jetson/SBSA. fa3-fwd is a soft runtime dep — vllm-omni's
  attention backends fall back to FA2 then SDPA when it's missing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit 4e154b5 into master May 6, 2026
55 checks passed
@mudler mudler deleted the fix/ci-rerankers-vllm-omni-aarch64 branch May 6, 2026 15:07
@localai-bot localai-bot added the bug Something isn't working label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants