Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vicuna-7B-v1.5-GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens")) #940

Closed
2 of 4 tasks
arun2728 opened this issue Aug 29, 2023 · 8 comments
Labels

Comments

@arun2728
Copy link

arun2728 commented Aug 29, 2023

System Info

  • The full command line used that caused the issue:
docker run --gpus all --name tgi-vicuna-7b-v1.5-gptq --shm-size 4g -p 8080:80 -v /models/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/vicuna-7B-v1.5-GPTQ --num-shard 2 --quantize gptq  --max-batch-prefill-tokens 1024
  • Model being used: TheBloke/vicuna-7B-v1.5-GPTQ
  • OS version: Ubuntu 22.04.2 LTS
  • Text Generation Inference Version: v1.0.2 (latest)
  • Hardware:
    • Cloud: GCP
    • GPUs: 2 * T4
    • CUDA Version: 12.0
    • Num of CPUs: 4
    • VRAM: 47GB
Screenshot 2023-08-29 at 10 36 34 AM

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run the docker command to reproduce:

docker run --gpus all --name tgi-vicuna-7b-v1.5-gptq --shm-size 4g -p 8080:80 -v /models/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/vicuna-7B-v1.5-GPTQ --num-shard 2 --quantize gptq  --max-batch-prefill-tokens 1024

Warnings:

2023-08-29T04:52:38.135870Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-08-29T04:52:38.135873Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

Error Stacktrace:

2023-08-28T12:11:04.665298Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 753, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 851, in generate_token
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 839, in generate_token
    out = self.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 815, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 480, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 439, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 383, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 329, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/activations.py", line 150, in forward
    return nn.functional.silu(input)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2059, in silu
    return torch._C._nn.silu(input)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 66, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 755, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-08-28T12:11:04.958211Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 753, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 851, in generate_token
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 839, in generate_token
    out = self.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 815, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 480, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 439, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 383, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 329, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/activations.py", line 150, in forward
    return nn.functional.silu(input)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2059, in silu
    return torch._C._nn.silu(input)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 66, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 755, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-08-28T11:47:13.809881Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1024}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-08-28T11:47:13.832819Z ERROR text_generation_launcher: Webserver Crashed
2023-08-28T11:47:13.832850Z  INFO text_generation_launcher: Shutting down shards
2023-08-28T11:47:14.423784Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
2023-08-28T11:47:14.558445Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1

Expected behavior

The model is expected to be loaded and warmed up without any error. But it's still throwing OOM even though only 22% of GPU Memory is occupied by the model before warmup. Check the below image.

image

@arun2728
Copy link
Author

Early I got the error as

RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

So I tried by reducing 4096 to 1024 still getting the same error?

@Narsil
Copy link
Collaborator

Narsil commented Aug 29, 2023

It seems exllama fails to work on T4 (compute_cap 7.5).

You should be able to run with -e DISABLE_EXLLAMA=True in your command.
I'll update TGI so this gets handled automatically from now on.

@ekarmazin
Copy link

@Narsil We are exeriencing the same exact error on T4 in GKE Autopilot. Setting -e DISABLE_EXLLAMA=True didn't work out. We got messages like:

Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True

but TGI still failed with:

RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens

@AzureSilent
Copy link

AzureSilent commented Aug 31, 2023

It seems exllama fails to work on T4 (compute_cap 7.5).

You should be able to run with -e DISABLE_EXLLAMA=True in your command. I'll update TGI so this gets handled automatically from now on.

This works on T4, the TGI won't saying:

ERROR warmup{max_input_length=256 max_prefill_tokens=256}:  Not enough memory to handle  256 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

// but saying:
Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True

BUT, the infer speed is very very slow.
Actually, the 7B model can be directly deployed on two T4, (with default max_input_length) without the need for quantization, resulting in a 8X faster speed compared to this. Sounds not like the GPTQ's performance?

@Narsil
Copy link
Collaborator

Narsil commented Aug 31, 2023

@AzureSilent Without exllama it's using the triton kernel which uses JIT.
The performance needs to be measured after a warmup on given test sizes (for instance run the benchmark twice).

Same in production, it will start being slow before getting back to more acceptable speeds.

However I'm not sure how well it compiles on T4 for sure, it's definitely possible that it's slower unfortunately (altough 8x seems like a lot)

@AzureSilent
Copy link

AzureSilent commented Sep 1, 2023

@Narsil
Thanks for your reply.
Yes, It indeed becomes faster after some warmup.
I've tried to deploy llama2-13B-GPTQ int4 on two GPU.
I found that 2080ti and T4 both can't run with exllama. But 2080ti is almost 2.5X faster than T4.
Howerver, when they run llama2-7b-fp16. the speed is no big difference.

--
when using 2080ti, llama2-13B-GPTQ int4 is approximately 40% slower than llama2-13B-fp16

@jrsperry
Copy link

jrsperry commented Oct 2, 2023

I also had an issue running pretty much any GPTQ model, I can't seem to run TheBloke/Llama-2-7b-Chat-GPTQ model (resulting in the same You need to decrease --max-batch-prefill-tokens error (although with a slightly different stack trace). But I can run the base meta model (meta-llama/Llama-2-7b-chat-hf) on the same nvidia L4 hardware.

Disabling exllama didn't help either.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Mar 28, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants