Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get stucked when run text-generation-benchmark on AMD gpu #2077

Open
4 tasks
yuqie opened this issue Jun 17, 2024 · 2 comments
Open
4 tasks

get stucked when run text-generation-benchmark on AMD gpu #2077

yuqie opened this issue Jun 17, 2024 · 2 comments

Comments

@yuqie
Copy link

yuqie commented Jun 17, 2024

System Info

Target: x86_64-unknown-linux-gnu
Cargo version: 1.78.0
Commit sha: 96b7b40
Docker label: sha-96b7b40-rocm

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I followed the step from website https://github.com/huggingface/hf-rocm-benchmark

  1. the docker container, a local model is used and the server is setup successfully.
docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 256g \
    --net host -v $(pwd)/hf_cache:/data -e HUGGING_FACE_HUB_TOKEN=$HF_READ_TOKEN \
    ghcr.io/huggingface/text-generation-inference:sha-293b8125-rocm \
    --model-id local_path/Meta-Llama-70B-Instruct --num-shard 8
  1. open another shell:docker exec -it tgi_container_name /bin/bash
  2. run the benchmark
text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct \
    --sequence-length 2048 --decode-length 128 --warmups 2 --runs 10 \
    -b 1 -b 2 

and it stucked after the following log

2024-06-17T11:01:59.291750Z  INFO text_generation_benchmark: benchmark/src/main.rs:138: Loading tokenizer
2024-06-17T11:01:59.291802Z  INFO text_generation_benchmark: benchmark/src/main.rs:144: Found local tokenizer
2024-06-17T11:01:59.336401Z  INFO text_generation_benchmark: benchmark/src/main.rs:161: Tokenizer loaded
2024-06-17T11:01:59.365280Z  INFO text_generation_benchmark: benchmark/src/main.rs:170: Connect to model server
2024-06-17T11:01:59.368575Z  INFO text_generation_benchmark: benchmark/src/main.rs:179: Connected

I also tried llama2-7b with a single GPU card with sequence-length of 512 and decode-length of 128, but stucked too.

2024-06-17T10:54:34.661975Z  INFO text_generation_launcher: Convert: [1/2] -- Took: 0:00:23.355863
2024-06-17T10:54:42.624075Z  INFO text_generation_launcher: Convert: [2/2] -- Took: 0:00:07.961668
2024-06-17T10:54:43.550339Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-17T10:54:43.550676Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-17T10:54:46.861699Z  INFO text_generation_launcher: Detected system rocm
2024-06-17T10:54:46.929654Z  INFO text_generation_launcher: ROCm: using Flash Attention 2 Composable Kernel implementation.
2024-06-17T10:54:47.181972Z  WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-06-17T10:54:53.564579Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-17T10:54:58.632695Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-17T10:54:58.670817Z  INFO shard-manager: text_generation_launcher: Shard ready in 15.119042733s rank=0
2024-06-17T10:54:58.766242Z  INFO text_generation_launcher: Starting Webserver
2024-06-17T10:54:58.849177Z  INFO text_generation_router: router/src/main.rs:302: Using config Some(Llama)
2024-06-17T10:54:58.849209Z  WARN text_generation_router: router/src/main.rs:311: no pipeline tag found for model /home/zhuh/7b-chat-hf
2024-06-17T10:54:58.849213Z  WARN text_generation_router: router/src/main.rs:329: Invalid hostname, defaulting to 0.0.0.0
2024-06-17T10:54:58.853566Z  INFO text_generation_router::server: router/src/server.rs:1552: Warming up model
2024-06-17T10:54:59.601144Z  INFO text_generation_launcher: PyTorch TunableOp (https://github.com/fxmarty/pytorch/tree/2.3-patched/aten/src/ATen/cuda/tunable) is enabled. The warmup may take several minutes, picking the ROCm optimal matrix multiplication kernel for the target lengths 1, 2, 4, 8, 16, 32, with typical 5-8% latency improvement for small sequence lengths. The picked GEMMs are saved in the file /data/tunableop_-home-zhuh-7b-chat-hf_tp1_rank0.csv. To disable TunableOp, please launch TGI with `PYTORCH_TUNABLEOP_ENABLED=0`.
2024-06-17T10:54:59.601247Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=1
2024-06-17T10:55:46.295162Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=2
2024-06-17T10:56:18.910991Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=4
2024-06-17T10:56:51.715308Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=8
2024-06-17T10:57:24.784412Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=16
2024-06-17T10:57:59.430531Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=32
2024-06-17T10:58:29.335915Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]
2024-06-17T10:58:30.344828Z  INFO text_generation_router::server: router/src/server.rs:1579: Using scheduler V3
2024-06-17T10:58:30.344853Z  INFO text_generation_router::server: router/src/server.rs:1631: Setting max batch total tokens to 346576
2024-06-17T10:58:30.360395Z  INFO text_generation_router::server: router/src/server.rs:1868: Connected

Expected behavior

Prefill and decode latency is expected but it gets stacked and output nothing in nearly one hour.
Besides, the GPU usibility is zero, which is non-zero when setup the warmup steps

@LysandreJik
Copy link
Member

Thanks for the report @yuqie!

cc @fxmarty as the author of the benchmark

@fxmarty
Copy link
Collaborator

fxmarty commented Jun 26, 2024

Hi @yuqie, thank you. What happens after launching

text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct \
    --sequence-length 2048 --decode-length 128 --warmups 2 --runs 10 \
    -b 1 -b 2 

in the second terminal within the container?

You should have a graphic benchmark like https://youtu.be/jlMAX2Oaht0?t=198 at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants