Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGI hard crashes after 1 OOM error #1960

Closed
2 of 4 tasks
pranavthombare opened this issue May 27, 2024 · 5 comments
Closed
2 of 4 tasks

TGI hard crashes after 1 OOM error #1960

pranavthombare opened this issue May 27, 2024 · 5 comments
Labels

Comments

@pranavthombare
Copy link

System Info

TGI docker image on GCP.
GPU: A100
Model: Phi-3

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Load the Phi3 model (pranavthombare/Phi-3-mini-4k-construct)
  2. run BM command: text-generation-benchmark -t pranavthombare/Phi-3-mini-4k-construct -s 512
  3. After it runs out of memory, run the same command again.

Expected behavior

The TGI launcher should not hard crash.

@pranavthombare
Copy link
Author

Below is the error I'm getting

    "timestamp": "2024-05-27T12:04:51.372064Z",
    "level": "ERROR",
    "fields": {
        "message": """'Shard complete standard error output:

        The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
        /opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed inversion 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
        warnings.warn(
        Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
        A new version of the following files was downloaded from https://huggingface.co/pranavthombare/Phi-3-mini-4k-construct:
        - configuration_phi3.py
        . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
        /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class \'text_generation_server.utils.dist.FakeGroup\'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
        warnings.warn(
        Exception ignored in: <function Server.__del__ at 0x7c5ff5530550>
        Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 186, in __del__
            cygrpc.schedule_coro_threadsafe(
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
            self._check_closed()
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
            raise RuntimeError(\'Event loop is closed\')
        RuntimeError: Event loop is closed
        sys:1: RuntimeWarning: coroutine \'AioServer.shutdown\' was never awaited
        Task exception was never retrieved
        future: <Task finished name=\'Task-2218\' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
        Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
            return await response
        File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
            raise error
        File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
            return await behavior(request_or_iterator, context)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 144, in Prefill
            generations, next_batch, timings = self.model.generate_token(batch)
        File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
            return func(*args, **kwds)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 960, in generate_token
            raise e
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 957, in generate_token
            out, speculative_logits = self.forward(batch)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 900, in forward
            return self.model.forward(
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in forward
            hidden_states = self.model(
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
            return self._call_impl(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
            return forward_call(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 340, in forward
            hidden_states, residual = layer(
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
            return self._call_impl(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
            return forward_call(*args, **kwargs)
        File"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 279, in forward
            mlp_output = self.mlp(normed_attn_res_output)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
            return self._call_impl(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
            return forward_call(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 226, in forward
            return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
        torch.cuda.OutOfMemoryError:CUDA out of memory. Tried to allocate 256.00 MiB. GPU 

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
            return get_command(self)(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
            return self.main(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
            return _main(
        File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
            rv = self.invoke(ctx)
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
            return _process_result(sub_ctx.command.invoke(sub_ctx))
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
            return ctx.invoke(self.callback, **ctx.params)
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
            return __callback(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
            return callback(**use_params)  # type: ignore
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
            server.serve(
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
            asyncio.run(
        File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
            return loop.run_until_complete(main)
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
            self.run_forever()
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
            self._run_once()
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
            handle._run()
        File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
            self._context.run(self._callback, *self._args)
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
        File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
            return await self.intercept(
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
            exit(1)
        File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
            raise SystemExit(code)
        SystemExit: 1'"""
    },
    "target": "text_generation_launcher",
    "span": {"rank": 0, "name": "shard-manager"},
    "spans": [{"rank": 0, "name": "shard-manager"}],
}

@pranavthombare
Copy link
Author

I don't think its a model specific issue. I need to reproduce it with other models although this never used to happen pre TGI 2.0.

@pranavthombare
Copy link
Author

Am able to reproduce it with mistral and llama models

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 13, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant