You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-06-17T11:01:59.291750Z INFO text_generation_benchmark: benchmark/src/main.rs:138: Loading tokenizer
2024-06-17T11:01:59.291802Z INFO text_generation_benchmark: benchmark/src/main.rs:144: Found local tokenizer
2024-06-17T11:01:59.336401Z INFO text_generation_benchmark: benchmark/src/main.rs:161: Tokenizer loaded
2024-06-17T11:01:59.365280Z INFO text_generation_benchmark: benchmark/src/main.rs:170: Connect to model server
2024-06-17T11:01:59.368575Z INFO text_generation_benchmark: benchmark/src/main.rs:179: Connected
I also tried llama2-7b with a single GPU card with sequence-length of 512 and decode-length of 128, but stucked too.
2024-06-17T10:54:34.661975Z INFO text_generation_launcher: Convert: [1/2] -- Took: 0:00:23.355863
2024-06-17T10:54:42.624075Z INFO text_generation_launcher: Convert: [2/2] -- Took: 0:00:07.961668
2024-06-17T10:54:43.550339Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-17T10:54:43.550676Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-17T10:54:46.861699Z INFO text_generation_launcher: Detected system rocm
2024-06-17T10:54:46.929654Z INFO text_generation_launcher: ROCm: using Flash Attention 2 Composable Kernel implementation.
2024-06-17T10:54:47.181972Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-06-17T10:54:53.564579Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-17T10:54:58.632695Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-17T10:54:58.670817Z INFO shard-manager: text_generation_launcher: Shard ready in 15.119042733s rank=0
2024-06-17T10:54:58.766242Z INFO text_generation_launcher: Starting Webserver
2024-06-17T10:54:58.849177Z INFO text_generation_router: router/src/main.rs:302: Using config Some(Llama)
2024-06-17T10:54:58.849209Z WARN text_generation_router: router/src/main.rs:311: no pipeline tag found for model /home/zhuh/7b-chat-hf
2024-06-17T10:54:58.849213Z WARN text_generation_router: router/src/main.rs:329: Invalid hostname, defaulting to 0.0.0.0
2024-06-17T10:54:58.853566Z INFO text_generation_router::server: router/src/server.rs:1552: Warming up model
2024-06-17T10:54:59.601144Z INFO text_generation_launcher: PyTorch TunableOp (https://github.com/fxmarty/pytorch/tree/2.3-patched/aten/src/ATen/cuda/tunable) is enabled. The warmup may take several minutes, picking the ROCm optimal matrix multiplication kernel for the target lengths 1, 2, 4, 8, 16, 32, with typical 5-8% latency improvement for small sequence lengths. The picked GEMMs are saved in the file /data/tunableop_-home-zhuh-7b-chat-hf_tp1_rank0.csv. To disable TunableOp, please launch TGI with `PYTORCH_TUNABLEOP_ENABLED=0`.
2024-06-17T10:54:59.601247Z INFO text_generation_launcher: Warming up TunableOp for seqlen=1
2024-06-17T10:55:46.295162Z INFO text_generation_launcher: Warming up TunableOp for seqlen=2
2024-06-17T10:56:18.910991Z INFO text_generation_launcher: Warming up TunableOp for seqlen=4
2024-06-17T10:56:51.715308Z INFO text_generation_launcher: Warming up TunableOp for seqlen=8
2024-06-17T10:57:24.784412Z INFO text_generation_launcher: Warming up TunableOp for seqlen=16
2024-06-17T10:57:59.430531Z INFO text_generation_launcher: Warming up TunableOp for seqlen=32
2024-06-17T10:58:29.335915Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]
2024-06-17T10:58:30.344828Z INFO text_generation_router::server: router/src/server.rs:1579: Using scheduler V3
2024-06-17T10:58:30.344853Z INFO text_generation_router::server: router/src/server.rs:1631: Setting max batch total tokens to 346576
2024-06-17T10:58:30.360395Z INFO text_generation_router::server: router/src/server.rs:1868: Connected
Expected behavior
Prefill and decode latency is expected but it gets stacked and output nothing in nearly one hour.
Besides, the GPU usibility is zero, which is non-zero when setup the warmup steps
The text was updated successfully, but these errors were encountered:
System Info
Target: x86_64-unknown-linux-gnu
Cargo version: 1.78.0
Commit sha: 96b7b40
Docker label: sha-96b7b40-rocm
Information
Tasks
Reproduction
I followed the step from website https://github.com/huggingface/hf-rocm-benchmark
docker exec -it tgi_container_name /bin/bash
and it stucked after the following log
I also tried llama2-7b with a single GPU card with sequence-length of 512 and decode-length of 128, but stucked too.
Expected behavior
Prefill and decode latency is expected but it gets stacked and output nothing in nearly one hour.
Besides, the GPU usibility is zero, which is non-zero when setup the warmup steps
The text was updated successfully, but these errors were encountered: