Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error deploying docker on P3 instance #424

Closed
2 of 4 tasks
FionnD opened this issue Jun 7, 2023 · 4 comments
Closed
2 of 4 tasks

Error deploying docker on P3 instance #424

FionnD opened this issue Jun 7, 2023 · 4 comments

Comments

@FionnD
Copy link

FionnD commented Jun 7, 2023

System Info

docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:0.8 --model-id tiiuae/falcon-7b --num-shard 2
ubuntu@ip-172-31-35-173:~$ nvidia-smi
Wed Jun  7 09:30:53 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I'm trying to deploy falcon-7b using the following command on an AWS p3.2xlarge

I begin by running the following docker run command

docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:0.8 --model-id tiiuae/falcon-7b --num-shard 2

But see a failure here. Which seems to be a mixture of CUDA and Flash errors. The output to Nvidia-smi is below the error stack

023-06-07T09:27:35.192913Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-07T09:27:35.193171Z  INFO text_generation_launcher: Starting shard 0
2023-06-07T09:27:35.193291Z  INFO text_generation_launcher: Starting shard 1
2023-06-07T09:27:38.785701Z  WARN shard-manager: text_generation_launcher: Could not import Flash Attention enabled models
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 58, in serve
    from text_generation_server import server
  File "<frozen importlib._bootstrap>", line 1058, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 12, in <module>
    from text_generation_server.cache import Cache
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cache.py", line 3, in <module>
    from text_generation_server.models.types import Batch
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 29, in <module>
    raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported
 rank=0
2023-06-07T09:27:38.785987Z  WARN shard-manager: text_generation_launcher: Could not import Flash Attention enabled models
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 58, in serve
    from text_generation_server import server
  File "<frozen importlib._bootstrap>", line 1058, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 12, in <module>
    from text_generation_server.cache import Cache
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cache.py", line 3, in <module>
    from text_generation_server.models.types import Batch
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 29, in <module>
    raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported
 rank=1
2023-06-07T09:27:39.417101Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 215, in get_model
    raise NotImplementedError(
NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed.
Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention with `cd server && make install install-flash-attention`
 rank=0
2023-06-07T09:27:39.424444Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 215, in get_model
    raise NotImplementedError(
NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed.
Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention with `cd server && make install install-flash-attention`
 rank=1
2023-06-07T09:27:39.997276Z ERROR text_generation_launcher: Shard 0 failed to start:
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 215, in get_model
    raise NotImplementedError(

NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed.
Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention with `cd server && make install install-flash-attention`


2023-06-07T09:27:39.997336Z  INFO text_generation_launcher: Shutting down shards
ubuntu@ip-172-31-35-173:~$ nvidia-smi
Wed Jun  7 09:30:53 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Expected behavior

I would have expected this docker image to work on a P3 instance

@OlivierDehaene
Copy link
Member

Volta GPUs are not supported for recent models.
Any reason you specifically want to deploy on p3? g5 is cheaper and supported.

@jshin49
Copy link

jshin49 commented Jun 7, 2023

I think if you use Llama it might work. I was able to make it work on V100 gpus

@FionnD
Copy link
Author

FionnD commented Jun 8, 2023

Thanks for the prompt reply, it runs fine on a G5 instance.

@FionnD FionnD closed this as completed Jun 8, 2023
@empty2enrich
Copy link

I think if you use Llama it might work. I was able to make it work on V100 gpus

I use v100, also encountered problems, log:

...
2023-09-06T03:45:33.105882Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-09-06T03:45:33.106285Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0

**2023-09-06T03:45:37.670306Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: GPU with CUDA capability 7 0 is not supported**

2023-09-06T03:45:43.115395Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants