Lllama 70B chat GPTQ fails with `Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease` --max-batch-prefill-tokens`"))` #755

vempaliakhil96 · 2023-08-01T16:06:08Z

System Info

System:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   27C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   29C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2023-08-01T16:01:44.732944Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 727, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 825, in generate_token
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 813, in generate_token
    out = self.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 789, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 475, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 434, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 378, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 322, in forward
    gate_up_states = self.gate_up_proj(hidden_states)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 186, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quant_linear.py", line 349, in forward
    out = QuantLinearFunction.apply(
  File "/opt/conda/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quant_linear.py", line 244, in forward
    output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quant_linear.py", line 209, in matmul248
    output = torch.empty(
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 60, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 729, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-08-01T16:01:44.733482Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-08-01T16:01:44.774473Z ERROR text_generation_launcher: Webserver Crashed
2023-08-01T16:01:44.774514Z  INFO text_generation_launcher: Shutting down shards
2023-08-01T16:01:45.363546Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
2023-08-01T16:01:45.428896Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=3
2023-08-01T16:01:45.540127Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
2023-08-01T16:01:45.540387Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=2

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

model=TheBloke/Llama-2-70B-chat-GPTQ
num_shard=4
docker run --rm --gpus all --shm-size 4g -p 8080:80 --name $container_name --log-driver=local --log-opt max-size=10m --log-opt max-file=3 -v $volume:/data --env HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN --env GPTQ_BITS=4 --env GPTQ_GROUPSIZE=1 ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard --quantize "gptq" --max-best-of 1  --trust-remote-code > serving.log &

Expected behavior

Unsure why I am getting a memory issue when GPU is only 70% filled before warmup

The text was updated successfully, but these errors were encountered:

maxibove13 · 2023-08-01T22:27:28Z

Exsactly same issue here, tried it with 7B, and 7B quantized.

Narsil · 2023-08-03T07:59:55Z

I think the error is rather clear:

not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

Basically warmup will attempt to run the biggest possible request at startime so detect if you settings will actually create an OOM.
If it does we crash early, instead of crashing randomly during runtime.

Thanks to it, your deployments should never oom during runtime.

Closing this issue, feel free to comment if that is not correct.

arun2728 · 2023-08-28T11:57:19Z

I am still facing the issue even after reducing it to 1024 for Vicuna 7b.

RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-08-28T11:47:13.809881Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1024}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-08-28T11:47:13.832819Z ERROR text_generation_launcher: Webserver Crashed
2023-08-28T11:47:13.832850Z  INFO text_generation_launcher: Shutting down shards
2023-08-28T11:47:14.423784Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
2023-08-28T11:47:14.558445Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1

GPU Stats before warmup

Docker Command:

docker run --gpus all --name tgi-vicuna-7b-v1.5-gptq --shm-size 4g -p 7002:80 -v /models/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/vicuna-7B-v1.5-GPTQ --num-shard 2 --quantize gptq  --max-batch-prefill-tokens 1024

System Information:

GPU: 2* 16GB T4
4-Core CPU
RAM: 45GB

Even tried setting --max-best-of 1, --env GPTQ_BITS=4 --env GPTQ_GROUPSIZE=1. Still facing the same issue?

Narsil · 2023-08-28T11:59:32Z

can you show the entire stacktrace ? Sometimes it's an entire issue altogether than is confused as OOM error.

arun2728 · 2023-08-28T12:11:46Z

2023-08-28T12:11:04.665298Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 753, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 851, in generate_token
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 839, in generate_token
    out = self.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 815, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 480, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 439, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 383, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 329, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/activations.py", line 150, in forward
    return nn.functional.silu(input)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2059, in silu
    return torch._C._nn.silu(input)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 66, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 755, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-08-28T12:11:04.958211Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 753, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 851, in generate_token
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 839, in generate_token
    out = self.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 815, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 480, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 439, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 383, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 329, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/activations.py", line 150, in forward
    return nn.functional.silu(input)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2059, in silu
    return torch._C._nn.silu(input)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 66, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 755, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

Narsil · 2023-08-28T12:44:03Z

File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2059, in silu
return torch._C._nn.silu(input)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

Hmm seems like the issue. no kernel image is available for execution on the device. T4 are supported, we do use them regularly. Any possibility you have cuda<11.8.

Can you please paste everything asked for here: https://github.com/huggingface/text-generation-inference/issues/new/choose ?

arun2728 · 2023-08-28T12:53:26Z

I don't think that's the case coz on nvidia-smi its showing cuda version as 12

P.S: I am running this in the GCP instance, machine type: custom-4-49152-ext

artemdinaburg · 2023-08-28T21:22:15Z

I am encountering the exact same error, also on GCP with two Tesla T4s, but with the Phind-CodeLlama-34B-v1-GPTQ model.

arun2728 · 2023-08-29T05:13:28Z

Can you please paste everything asked for here: https://github.com/huggingface/text-generation-inference/issues/new/choose ?

@Narsil I have created an issue here: #940

Narsil closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lllama 70B chat GPTQ fails with `Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease` --max-batch-prefill-tokens`"))` #755

Lllama 70B chat GPTQ fails with `Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease` --max-batch-prefill-tokens`"))` #755

vempaliakhil96 commented Aug 1, 2023

maxibove13 commented Aug 1, 2023

Narsil commented Aug 3, 2023

arun2728 commented Aug 28, 2023

Narsil commented Aug 28, 2023

arun2728 commented Aug 28, 2023

Narsil commented Aug 28, 2023

arun2728 commented Aug 28, 2023

artemdinaburg commented Aug 28, 2023

arun2728 commented Aug 29, 2023

Lllama 70B chat GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens")) #755

Lllama 70B chat GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens")) #755

Comments

vempaliakhil96 commented Aug 1, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

maxibove13 commented Aug 1, 2023

Narsil commented Aug 3, 2023

arun2728 commented Aug 28, 2023

GPU Stats before warmup

Narsil commented Aug 28, 2023

arun2728 commented Aug 28, 2023

Narsil commented Aug 28, 2023

arun2728 commented Aug 28, 2023

artemdinaburg commented Aug 28, 2023

arun2728 commented Aug 29, 2023

Lllama 70B chat GPTQ fails with `Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease` --max-batch-prefill-tokens`"))` #755

Lllama 70B chat GPTQ fails with `Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease` --max-batch-prefill-tokens`"))` #755