Problem Description
docker run --gpus all --rm --name qwen36-35 -p 8080:8000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" -e VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm-nightly-transformers-main --model Intel/Qwen3.6-35B-A3B-int4-AutoRound --served-model-name "qwen/qwen36-35b" --trust-remote-code --api-key mumu-102495153 --max-model-len 192382 --max-num-seqs 4 --gpu-memory-utilization 0.98 --enable-auto-tool-choice --tool-call-parser qwen3_coder --kv-cache-dtype fp8 --reasoning-parser qwen3 --max-num-batched-tokens 8192 --enable-prefix-caching
lianglv
Intel org
2 days ago
We don't have your docker image. Could you provide minimal steps to reproduce the infinite loop issue?
wenhuach
Intel org
2 days ago
Does this issue occur for all prompts, or only for specific ones? We would appreciate it if you could share some example prompts that reproduce the issue.
pathosethoslogos
2 days ago
•
edited 2 days ago
I can confirm, indeed this is the case.
You can tell by the high number of downloads and low number of hearts for this model.
zsmweb
about 21 hours ago
I use an RTX 3090 with 24GB to run the model. Not every conversation gets stuck in a loop. I use CherryStudio to check the weather.
Here is my Dockerfile.
cat Dockerfile
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PIP_NO_CACHE_DIR=1
ENV PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y
python3 python3-pip git
&& rm -rf /var/lib/apt/lists/*
RUN python3 -m pip install --upgrade pip setuptools wheel
RUN python3 -m pip install -U
vllm --pre
--index-url https://pypi.org/simple
--extra-index-url https://wheels.vllm.ai/nightly
RUN python3 -m pip install -U
git+https://github.com/huggingface/transformers.git
RUN python3 -m pip install conch-triton-kernels
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
Reproduction Steps
~
Environment Information
No response
Error Logs
Additional Context
No response
Problem Description
docker run --gpus all --rm --name qwen36-35 -p 8080:8000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" -e VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm-nightly-transformers-main --model Intel/Qwen3.6-35B-A3B-int4-AutoRound --served-model-name "qwen/qwen36-35b" --trust-remote-code --api-key mumu-102495153 --max-model-len 192382 --max-num-seqs 4 --gpu-memory-utilization 0.98 --enable-auto-tool-choice --tool-call-parser qwen3_coder --kv-cache-dtype fp8 --reasoning-parser qwen3 --max-num-batched-tokens 8192 --enable-prefix-caching
lianglv
Intel org
2 days ago
We don't have your docker image. Could you provide minimal steps to reproduce the infinite loop issue?
wenhuach
Intel org
2 days ago
Does this issue occur for all prompts, or only for specific ones? We would appreciate it if you could share some example prompts that reproduce the issue.
pathosethoslogos
2 days ago
•
edited 2 days ago
I can confirm, indeed this is the case.
You can tell by the high number of downloads and low number of hearts for this model.
zsmweb
about 21 hours ago
I use an RTX 3090 with 24GB to run the model. Not every conversation gets stuck in a loop. I use CherryStudio to check the weather.
Here is my Dockerfile.
cat Dockerfile
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PIP_NO_CACHE_DIR=1
ENV PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y
python3 python3-pip git
&& rm -rf /var/lib/apt/lists/*
RUN python3 -m pip install --upgrade pip setuptools wheel
RUN python3 -m pip install -U
vllm --pre
--index-url https://pypi.org/simple
--extra-index-url https://wheels.vllm.ai/nightly
RUN python3 -m pip install -U
git+https://github.com/huggingface/transformers.git
RUN python3 -m pip install conch-triton-kernels
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
Reproduction Steps
~
Environment Information
No response
Error Logs
Additional Context
No response