TGI hard crashes after 1 OOM error #1960

pranavthombare · 2024-05-27T16:52:04Z

System Info

TGI docker image on GCP.
GPU: A100
Model: Phi-3

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Load the Phi3 model (pranavthombare/Phi-3-mini-4k-construct)
run BM command: text-generation-benchmark -t pranavthombare/Phi-3-mini-4k-construct -s 512
After it runs out of memory, run the same command again.

Expected behavior

The TGI launcher should not hard crash.

The text was updated successfully, but these errors were encountered:

pranavthombare · 2024-05-27T16:55:15Z

Below is the error I'm getting

    "timestamp": "2024-05-27T12:04:51.372064Z",
    "level": "ERROR",
    "fields": {
        "message": """'Shard complete standard error output:

        The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
        /opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed inversion 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
        warnings.warn(
        Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
        A new version of the following files was downloaded from https://huggingface.co/pranavthombare/Phi-3-mini-4k-construct:
        - configuration_phi3.py
        . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
        /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class \'text_generation_server.utils.dist.FakeGroup\'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
        warnings.warn(
        Exception ignored in: <function Server.__del__ at 0x7c5ff5530550>
        Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 186, in __del__
            cygrpc.schedule_coro_threadsafe(
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
            self._check_closed()
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
            raise RuntimeError(\'Event loop is closed\')
        RuntimeError: Event loop is closed
        sys:1: RuntimeWarning: coroutine \'AioServer.shutdown\' was never awaited
        Task exception was never retrieved
        future: <Task finished name=\'Task-2218\' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
        Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
            return await response
        File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
            raise error
        File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
            return await behavior(request_or_iterator, context)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 144, in Prefill
            generations, next_batch, timings = self.model.generate_token(batch)
        File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
            return func(*args, **kwds)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 960, in generate_token
            raise e
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 957, in generate_token
            out, speculative_logits = self.forward(batch)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 900, in forward
            return self.model.forward(
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in forward
            hidden_states = self.model(
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
            return self._call_impl(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
            return forward_call(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 340, in forward
            hidden_states, residual = layer(
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
            return self._call_impl(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
            return forward_call(*args, **kwargs)
        File"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 279, in forward
            mlp_output = self.mlp(normed_attn_res_output)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
            return self._call_impl(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
            return forward_call(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 226, in forward
            return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
        torch.cuda.OutOfMemoryError:CUDA out of memory. Tried to allocate 256.00 MiB. GPU 

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
            return get_command(self)(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
            return self.main(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
            return _main(
        File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
            rv = self.invoke(ctx)
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
            return _process_result(sub_ctx.command.invoke(sub_ctx))
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
            return ctx.invoke(self.callback, **ctx.params)
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
            return __callback(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
            return callback(**use_params)  # type: ignore
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
            server.serve(
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
            asyncio.run(
        File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
            return loop.run_until_complete(main)
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
            self.run_forever()
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
            self._run_once()
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
            handle._run()
        File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
            self._context.run(self._callback, *self._args)
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
        File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
            return await self.intercept(
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
            exit(1)
        File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
            raise SystemExit(code)
        SystemExit: 1'"""
    },
    "target": "text_generation_launcher",
    "span": {"rank": 0, "name": "shard-manager"},
    "spans": [{"rank": 0, "name": "shard-manager"}],
}

pranavthombare · 2024-05-28T02:47:12Z

I don't think its a model specific issue. I need to reproduce it with other models although this never used to happen pre TGI 2.0.

pranavthombare · 2024-06-12T07:47:57Z

Am able to reproduce it with mistral and llama models

pranavthombare · 2024-06-12T12:21:49Z

https://github.com/huggingface/text-generation-inference/pull/1736/files#diff-d92dc83f92b9c93839931357ef40af2ba48f62e5598a59e7478beebce4e5688eR26

I think this is the reason why.

github-actions · 2024-07-13T01:50:59Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Jul 13, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TGI hard crashes after 1 OOM error #1960

TGI hard crashes after 1 OOM error #1960

pranavthombare commented May 27, 2024

pranavthombare commented May 27, 2024

pranavthombare commented May 28, 2024

pranavthombare commented Jun 12, 2024

pranavthombare commented Jun 12, 2024

github-actions bot commented Jul 13, 2024

TGI hard crashes after 1 OOM error #1960

TGI hard crashes after 1 OOM error #1960

Comments

pranavthombare commented May 27, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

pranavthombare commented May 27, 2024

pranavthombare commented May 28, 2024

pranavthombare commented Jun 12, 2024

pranavthombare commented Jun 12, 2024

github-actions bot commented Jul 13, 2024