fix(ci): unbreak rerankers (torch bump) and vllm-omni on aarch64#9688
Merged
Conversation
Two unrelated CI breakages bundled together since both are one-liners: - rerankers: bump torch 2.4.1 -> 2.7.1 on cpu/cublas12. The unpinned transformers resolves to 5.x, whose moe.py registers a custom_op with string-typed `'torch.Tensor'` annotations that torch 2.4.1's infer_schema rejects, blocking the gRPC server from starting and failing all 5 backend tests with "Connection refused" on :50051. Matches the version used by the transformers backend. - vllm-omni: strip fa3-fwd from the upstream requirements/cuda.txt before resolving on aarch64. fa3-fwd 0.0.3 ships only an x86_64 wheel and has no sdist, making the cuda profile unsatisfiable on Jetson/SBSA. fa3-fwd is a soft runtime dep — vllm-omni's attention backends fall back to FA2 then SDPA when it's missing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent CI breakages on master, bundled here because each fix is a one-liner.
rerankers backend tests fail (cpu / cublas12)
requirements-cpu.txtandrequirements-cublas12.txtpintorch==2.4.1but leavetransformersunpinned. The latesttransformers(5.x) registers a custom op intransformers/integrations/moe.py:```python
torch.library.custom_op("transformers::grouped_mm_fallback", _grouped_mm_fallback, mutates_args=())
```
`_grouped_mm_fallback`'s signature uses string-typed annotations (
'torch.Tensor'). torch 2.4.1'sinfer_schemadoes not understand them and raisesValueError: Parameter input has unsupported type torch.Tensor. That import error prevents the gRPC server from starting, so all 5 rerankers tests fail withConnection refusedagainst127.0.0.1:50051.Fix: bump
torch==2.4.1→torch==2.7.1. Matches the pin used by thetransformersbackend in this repo and is the most common torch version across our Python backends. Leavestransformersunpinned so we keep tracking upstream.vllm-omni build fails on aarch64
vllm-omni'ssetup.pyresolves dependencies dynamically and loadsrequirements/cuda.txton cuda hosts, which pinsfa3-fwd==0.0.3. fa3-fwd ships onlymanylinux_2_24_x86_64wheels and has no source distribution, so on aarch64 (e.g. l4t13 / SBSA cu130) uv fails with:```
Because fa3-fwd==0.0.3 has no wheels with a matching platform tag (e.g.,
`manylinux_2_39_aarch64`) and vllm-omni==... depends on fa3-fwd==0.0.3,
we can conclude that vllm-omni cannot be used.
```
Fix: in
install.sh, after cloning vllm-omni, stripfa3-fwdfromrequirements/cuda.txtwhen building on aarch64. This is safe — fa3-fwd is a soft runtime dep:vllm_omni/diffusion/attention/backends/utils/fa.pywrapsfrom fa3_fwd_interface import ...intry/except ImportErrorand falls back through FA3 source build → FA2 (flash_attn) → vLLM's wrapper.vllm_omni/diffusion/attention/backends/flash_attn.pyonly raises if every FA backend is missing, with a message offering SDPA as a final fallback.Test plan
tests-apple/tests-linuxfor the rerankers backendpip install -e .step for vllm-omni without resolver failureAssisted-by: Claude:claude-opus-4-7-1m [Claude Code]