GooseLLM — vLLM for NVIDIA V100 (SM70)

High-throughput LLM inference on Tesla V100 GPUs with custom FlashAttention-2 kernel from ai-bond.

Acknowledgements

Special thanks to 1CatAI for their amazing V100 builds! Check out their repository for production-ready vLLM optimizations.

Quick Start

Docker Build

docker build \
  -f docker/Dockerfile.sm70-build \
  -t goosellm:sm70 \
  .

Local Build

# 1. Create and activate environment
conda create -n goosellm python=3.12 -y
conda activate goosellm

# 2. Install dependencies
python -m pip install --upgrade pip setuptools wheel
python -m pip install torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/cu128
python -m pip install -r requirements/cuda.txt
python -m pip install 'setuptools>=77.0.3,<81.0.0' 'setuptools_scm>=8' grpcio-tools cmake build

# 3. Set build environment
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:${LD_LIBRARY_PATH:-}
export VLLM_TARGET_DEVICE=cuda
export VLLM_MAIN_CUDA_VERSION=12.8
export TORCH_CUDA_ARCH_LIST=7.0
export MAX_JOBS=$(nproc)
export NVCC_THREADS=4

# 4. Build SM70 kernel
cd csrc/flash_attention_v100
sed -i 's/if not torch.cuda.is_available():/if False: # if not torch.cuda.is_available():/' setup.py
python setup.py build_ext --inplace
cd ../..

# 5. Build vLLM wheel (matches 1Cat's original process)
rm -rf build vllm.egg-info .deps/*-build .deps/*-subbuild
SETUPTOOLS_SCM_PRETEND_VERSION=0.0.3.dev0 \
  python -m build --wheel --no-isolation --outdir dist-cu128-sm70

# 6. Install
python -m pip install dist-cu128-sm70/*.whl --no-deps

Example Serving Commands

Before running any of the commands below, make sure you're in the conda environment:

conda activate goosellm

MoE Model (Qwen3.6-35B-A3B-AWQ)

python -m vllm.entrypoints.openai.api_server \
  --model QuantTrio/Qwen3.6-35B-A3B-AWQ \
  --tensor-parallel-size 4 \
  --dtype float16 \
  --gpu-memory-utilization 0.80 \
  --max-model-len 262144 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 16384 \
  --trust-remote-code \
  --attention-backend FLASH_ATTN_V100 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image":0,"video":0}' \
  --compilation-config '{"cudagraph_mode":"full_and_piecewise","cudagraph_capture_sizes":[1]}' \
  --host 0.0.0.0 \
  --port 8000

Dense Model (Qwen3.6-27B-AWQ)

python -m vllm.entrypoints.openai.api_server \
  --model QuantTrio/Qwen3.6-27B-AWQ \
  --tensor-parallel-size 4 \
  --dtype float16 \
  --gpu-memory-utilization 0.80 \
  --max-model-len 262144 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 16384 \
  --trust-remote-code \
  --attention-backend FLASH_ATTN_V100 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image":0,"video":0}' \
  --compilation-config '{"cudagraph_mode":"full_and_piecewise","cudagraph_capture_sizes":[1]}' \
  --host 0.0.0.0 \
  --port 8000

Docker MoE (Qwen3.6-35B-A3B-AWQ/ 122B)

docker run --rm \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e NCCL_P2P_LEVEL=NVL \
  goosellm:sm70 \
  python -m vllm.entrypoints.openai.api_server \
    --model QuantTrio/Qwen3.6-35B-A3B-AWQ \
    --quantization awq \
    --dtype float16 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 262144 \
    --tensor-parallel-size 4 \
    --max-num-seqs 1 \
    --max-num-batched-tokens 16384 \
    --skip-mm-profiling \
    --attention-backend FLASH_ATTN_V100 \
    --limit-mm-per-prompt '{"image":0,"video":0}' \
    --compilation-config '{"cudagraph_mode":"full_and_piecewise","cudagraph_capture_sizes":[1]}'

Docker Dense (Qwen3.6-27B-AWQ)

docker run --rm \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e NCCL_P2P_LEVEL=NVL \
  goosellm:sm70 \
  python -m vllm.entrypoints.openai.api_server \
    --model QuantTrio/Qwen3.6-27B-AWQ \
    --quantization awq \
    --dtype float16 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 262144 \
    --tensor-parallel-size 4 \
    --max-num-seqs 1 \
    --max-num-batched-tokens 16384 \
    --skip-mm-profiling \
    --attention-backend FLASH_ATTN_V100 \
    --limit-mm-per-prompt '{"image":0,"video":0}' \
    --compilation-config '{"cudagraph_mode":"full_and_piecewise","cudagraph_capture_sizes":[1]}' \
    --host 0.0.0.0 \
    --port 8000

Results

References

Original V100 kernel research: ai-bond/flash-attention-v100
Upstream vLLM: 1CatAI/1Cat-vLLM

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.buildkite		.buildkite
.gemini		.gemini
.github		.github
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docker		docker
docs		docs
examples		examples
lmdeploy		lmdeploy
requirements		requirements
research/spark-m8n8k4		research/spark-m8n8k4
tests		tests
tools		tools
vllm		vllm
vllm_src_backup/v1/attention/backends		vllm_src_backup/v1/attention/backends
测试结果		测试结果
.clang-format		.clang-format
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
AGENTS.md		AGENTS.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
FA2_INTEGRATION.md		FA2_INTEGRATION.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
OPEN_SOURCE_SM70_GUIDE.md		OPEN_SOURCE_SM70_GUIDE.md
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GooseLLM — vLLM for NVIDIA V100 (SM70)

Acknowledgements

Quick Start

Docker Build

Local Build

Example Serving Commands

MoE Model (Qwen3.6-35B-A3B-AWQ)

Dense Model (Qwen3.6-27B-AWQ)

Docker MoE (Qwen3.6-35B-A3B-AWQ/ 122B)

Docker Dense (Qwen3.6-27B-AWQ)

Results

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GooseLLM — vLLM for NVIDIA V100 (SM70)

Acknowledgements

Quick Start

Docker Build

Local Build

Example Serving Commands

MoE Model (Qwen3.6-35B-A3B-AWQ)

Dense Model (Qwen3.6-27B-AWQ)

Docker MoE (Qwen3.6-35B-A3B-AWQ/ 122B)

Docker Dense (Qwen3.6-27B-AWQ)

Results

References

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages