Vicuna-7B-v1.5-GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`")) #940

arun2728 · 2023-08-29T05:09:22Z

System Info

The full command line used that caused the issue:

docker run --gpus all --name tgi-vicuna-7b-v1.5-gptq --shm-size 4g -p 8080:80 -v /models/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/vicuna-7B-v1.5-GPTQ --num-shard 2 --quantize gptq  --max-batch-prefill-tokens 1024

Model being used: TheBloke/vicuna-7B-v1.5-GPTQ
OS version: Ubuntu 22.04.2 LTS
Text Generation Inference Version: v1.0.2 (latest)
Hardware:
- Cloud: GCP
- GPUs: 2 * T4
- CUDA Version: 12.0
- Num of CPUs: 4
- VRAM: 47GB

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run the docker command to reproduce:

docker run --gpus all --name tgi-vicuna-7b-v1.5-gptq --shm-size 4g -p 8080:80 -v /models/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/vicuna-7B-v1.5-GPTQ --num-shard 2 --quantize gptq  --max-batch-prefill-tokens 1024

Warnings:

2023-08-29T04:52:38.135870Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-08-29T04:52:38.135873Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

Error Stacktrace:

2023-08-28T12:11:04.665298Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 753, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 851, in generate_token
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 839, in generate_token
    out = self.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 815, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 480, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 439, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 383, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 329, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/activations.py", line 150, in forward
    return nn.functional.silu(input)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2059, in silu
    return torch._C._nn.silu(input)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 66, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 755, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-08-28T12:11:04.958211Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 753, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 851, in generate_token
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 839, in generate_token
    out = self.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 815, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 480, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 439, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 383, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 329, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/activations.py", line 150, in forward
    return nn.functional.silu(input)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2059, in silu
    return torch._C._nn.silu(input)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 66, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 755, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-08-28T11:47:13.809881Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1024}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-08-28T11:47:13.832819Z ERROR text_generation_launcher: Webserver Crashed
2023-08-28T11:47:13.832850Z  INFO text_generation_launcher: Shutting down shards
2023-08-28T11:47:14.423784Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
2023-08-28T11:47:14.558445Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1

Expected behavior

The model is expected to be loaded and warmed up without any error. But it's still throwing OOM even though only 22% of GPU Memory is occupied by the model before warmup. Check the below image.

The text was updated successfully, but these errors were encountered:

arun2728 · 2023-08-29T05:11:29Z

Early I got the error as

RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

So I tried by reducing 4096 to 1024 still getting the same error?

Narsil · 2023-08-29T10:53:09Z

It seems exllama fails to work on T4 (compute_cap 7.5).

You should be able to run with -e DISABLE_EXLLAMA=True in your command.
I'll update TGI so this gets handled automatically from now on.

ekarmazin · 2023-08-31T00:05:30Z

@Narsil We are exeriencing the same exact error on T4 in GKE Autopilot. Setting -e DISABLE_EXLLAMA=True didn't work out. We got messages like:

Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True

but TGI still failed with:

RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens

AzureSilent · 2023-08-31T03:30:30Z

It seems exllama fails to work on T4 (compute_cap 7.5).

You should be able to run with -e DISABLE_EXLLAMA=True in your command. I'll update TGI so this gets handled automatically from now on.

This works on T4, the TGI won't saying:

ERROR warmup{max_input_length=256 max_prefill_tokens=256}:  Not enough memory to handle  256 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

// but saying:
Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True

BUT, the infer speed is very very slow.
Actually, the 7B model can be directly deployed on two T4, (with default max_input_length) without the need for quantization, resulting in a 8X faster speed compared to this. Sounds not like the GPTQ's performance?

Narsil · 2023-08-31T19:44:36Z

@AzureSilent Without exllama it's using the triton kernel which uses JIT.
The performance needs to be measured after a warmup on given test sizes (for instance run the benchmark twice).

Same in production, it will start being slow before getting back to more acceptable speeds.

However I'm not sure how well it compiles on T4 for sure, it's definitely possible that it's slower unfortunately (altough 8x seems like a lot)

AzureSilent · 2023-09-01T14:53:04Z

@Narsil
Thanks for your reply.
Yes, It indeed becomes faster after some warmup.
I've tried to deploy llama2-13B-GPTQ int4 on two GPU.
I found that 2080ti and T4 both can't run with exllama. But 2080ti is almost 2.5X faster than T4.
Howerver, when they run llama2-7b-fp16. the speed is no big difference.

--
when using 2080ti, llama2-13B-GPTQ int4 is approximately 40% slower than llama2-13B-fp16

jrsperry · 2023-10-02T01:12:23Z

I also had an issue running pretty much any GPTQ model, I can't seem to run TheBloke/Llama-2-7b-Chat-GPTQ model (resulting in the same You need to decrease --max-batch-prefill-tokens error (although with a slightly different stack trace). But I can run the base meta model (meta-llama/Llama-2-7b-chat-hf) on the same nvidia L4 hardware.

Disabling exllama didn't help either.

github-actions · 2024-03-28T01:45:13Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

arun2728 mentioned this issue Aug 29, 2023

Lllama 70B chat GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens")) #755

Closed

4 tasks

github-actions bot added the Stale label Mar 28, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vicuna-7B-v1.5-GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`")) #940

Vicuna-7B-v1.5-GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`")) #940

arun2728 commented Aug 29, 2023 •

edited

arun2728 commented Aug 29, 2023

Narsil commented Aug 29, 2023

ekarmazin commented Aug 31, 2023

AzureSilent commented Aug 31, 2023 •

edited

Narsil commented Aug 31, 2023

AzureSilent commented Sep 1, 2023 •

edited

jrsperry commented Oct 2, 2023 •

edited

github-actions bot commented Mar 28, 2024

Vicuna-7B-v1.5-GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens")) #940

Vicuna-7B-v1.5-GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens")) #940

Comments

arun2728 commented Aug 29, 2023 • edited

System Info

Information

Tasks

Reproduction

Expected behavior

arun2728 commented Aug 29, 2023

Narsil commented Aug 29, 2023

ekarmazin commented Aug 31, 2023

AzureSilent commented Aug 31, 2023 • edited

Narsil commented Aug 31, 2023

AzureSilent commented Sep 1, 2023 • edited

jrsperry commented Oct 2, 2023 • edited

github-actions bot commented Mar 28, 2024

Vicuna-7B-v1.5-GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`")) #940

Vicuna-7B-v1.5-GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`")) #940

arun2728 commented Aug 29, 2023 •

edited

AzureSilent commented Aug 31, 2023 •

edited

AzureSilent commented Sep 1, 2023 •

edited

jrsperry commented Oct 2, 2023 •

edited