Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run benchmark #170

Closed
CrossNox opened this issue Apr 24, 2023 · 5 comments
Closed

Unable to run benchmark #170

CrossNox opened this issue Apr 24, 2023 · 5 comments

Comments

@CrossNox
Copy link

I'm trying to run the txt2img benchmark, but can't get it to work.

I tried several combinations of versions for deepspeed (incl. from source), deepspeed-mii (incl. from source), torch==1.13.1 (also with 1.12.1 when using cuda 11.6), diffusers, transformers and triton.

Environment:

  • OS: Amazon Linux 2 w/kernel 5.10
  • GPU: Nvidia Tesla T4
  • python version: 3.9, compiled from source
  • nvidia driver: 510.108.03-grid
  • cuda version: 11.7

pip install -U deepspeed==0.7.5 deepspeed-mii==0.0.3 torch==1.13.1 diffusers==0.7.1 transformers==4.24.0 triton==2.0.0.dev20221030 ftfy, pip install -U git+https://github.com/microsoft/DeepSpeed.git@c9c6ab9e32b054136c3a125900d6e2ed937432be deepspeed-mii==0.0.4 torch==1.13.1 diffusers==0.11.1 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy, pip install -U deepspeed==0.8.0 deepspeed-mii==0.0.4 torch==1.13.1 diffusers==0.11.1 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy, pip install -U deepspeed==0.8.0 deepspeed-mii==0.0.4 torch==1.13.1 diffusers==0.11.1 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy, pip install -U deepspeed==0.7.7 deepspeed-mii==0.0.4 torch==1.13.1 diffusers==0.10.2 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy all fail with {created_time:"2023-04-24T15:34:23.563287825+00:00", grpc_status:2, grpc_message:"Exception calling application: Triton Error [CUDA]: invalid argument"}"

pip install -U deepspeed==0.9.1 deepspeed-mii==0.0.3 torch==1.13.1 diffusers==0.14.0 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy fails with ImportError: cannot import name 'CLIPTextModelWithProjection' from 'transformers' (/home/ec2-user/.local/lib/python3.9/site-packages/transformers/__init__.py)

pip install -U deepspeed==0.9.1 deepspeed-mii==0.0.3 torch==1.13.1 diffusers==0.14.0 transformers==4.26.0 triton==2.0.0.dev20221202 ftfy fails with AttributeError: 'StableDiffusionPipeline' object has no attribute 'children'

pip install -U deepspeed==0.7.5 deepspeed-mii==0.0.3 torch==1.13.1 diffusers==0.11.1 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy fails with {created_time:"2023-04-24T16:19:36.666852314+00:00", grpc_status:2, grpc_message:"Exception calling application: \'DSUNet\' object has no attribute \'config\'"}"

pip install -U git+https://github.com/microsoft/DeepSpeed.git@35eabb0a336e7a8e9950a550475ceaebda42066c deepspeed-mii==0.0.4 torch==1.13.1 diffusers==0.11.1 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy fails with {grpc_message:"Exception calling application: forward() got an unexpected keyword argument \'encoder_hidden_states\'", grpc_status:2, created_time:"2023-04-24T16:34:55.067998104+00:00"}". Same for pip install -U deepspeed==0.7.7 deepspeed-mii==0.0.4 torch==1.13.1 diffusers==0.11.1 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy

Where this issue seems relevant.

pip install -U deepspeed==0.7.6 deepspeed-mii==0.0.4 torch==1.13.1 diffusers==0.11.1 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy fails with {created_time:"2023-04-24T17:29:38.723537173+00:00", grpc_status:2, grpc_message:"Exception calling application: \'DSUNet\' object has no attribute \'config\'"}". Same for pip install -U deepspeed==0.7.6 deepspeed-mii==0.0.4 torch==1.13.1 diffusers==0.11.1 transformers==4.24.0 triton==2.0.0.dev20221202 ftfy

Where this other issue seems relevant.

Is there any set of versions that is well known to work?

@mrwyattii
Copy link
Contributor

mrwyattii commented Apr 25, 2023

@CrossNox thank you for reporting this issue. It looks like a recent change in DeepSpeed was not accounted for in MII. Please try this PR: #172

I just tested with the following:

deepspeed          0.9.2
diffusers          0.14.0
torch              1.13.1
transformers       4.28.1

@CrossNox
Copy link
Author

CrossNox commented Apr 26, 2023

Hi @mrwyattii, thanks for the reply.
deepspeed 0.9.2 is not avaialable on pypi, nor is a tagged version on github. Did you install from source? At which commit?

Also, which triton version did you get it to work with?

For completeness:

packages installed

Details


$ pip show deepspeed diffusers torch transformers triton deepspeed-mii
Name: deepspeed
Version: 0.9.2+0e357666
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: MIT
Location: /home/ec2-user/venv39/lib/python3.9/site-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, torch, tqdm
Required-by: deepspeed-mii
---
Name: diffusers
Version: 0.14.0
Summary: Diffusers
Home-page: https://github.com/huggingface/diffusers
Author: The HuggingFace team
Author-email: patrick@huggingface.co
License: Apache
Location: /home/ec2-user/venv39/lib/python3.9/site-packages
Requires: filelock, huggingface-hub, importlib-metadata, numpy, Pillow, regex, requests
Required-by: 
---
Name: torch
Version: 1.13.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/ec2-user/venv39/lib/python3.9/site-packages
Requires: nvidia-cublas-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-runtime-cu11, nvidia-cudnn-cu11, typing-extensions
Required-by: deepspeed, deepspeed-mii, triton
---
Name: transformers
Version: 4.28.1
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /home/ec2-user/venv39/lib/python3.9/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, tokenizers, tqdm
Required-by: deepspeed-mii
---
Name: triton
Version: 2.0.0.dev20221202
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License: 
Location: /home/ec2-user/venv39/lib/python3.9/site-packages
Requires: cmake, filelock, torch
Required-by: 
---
Name: deepspeed-mii
Version: 0.0.5+835a2a9
Summary: deepspeed mii
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-mii@microsoft.com
License: UNKNOWN
Location: /home/ec2-user/venv39/lib/python3.9/site-packages
Requires: asyncio, deepspeed, Flask-RESTful, grpcio, grpcio-tools, pydantic, torch, transformers, Werkzeug
Required-by:

output of ds_report

Details


$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/venv39/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/ec2-user/venv39/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.2+0e357666, 0e357666, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

python

$ python --version
Python 3.9.16

error

Details


Traceback (most recent call last):
  File "/home/ec2-user/DeepSpeed-MII/examples/benchmark/txt2img/mii-sd.py", line 27, in <module>
    results = pipe.query(prompts)
  File "/home/ec2-user/venv39/lib/python3.9/site-packages/mii/client.py", line 125, in query
    response = self.asyncio_loop.run_until_complete(
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/ec2-user/venv39/lib/python3.9/site-packages/mii/client.py", line 109, in _query_in_tensor_parallel
    await responses[0]
  File "/home/ec2-user/venv39/lib/python3.9/site-packages/mii/client.py", line 72, in _request_async_response
    proto_response = await getattr(self.stub, conversions["method"])(proto_request)
  File "/home/ec2-user/venv39/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception calling application: Triton Error [CUDA]: invalid argument"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2023-04-26T14:01:16.677843111+00:00", grpc_status:2, grpc_message:"Exception calling application: Triton Error [CUDA]: invalid argument"}"
>

Error with triton==2.0.0.post1

Details

Traceback (most recent call last):
  File "/home/ec2-user/DeepSpeed-MII/examples/benchmark/txt2img/mii-sd.py", line 27, in <module>
    results = pipe.query(prompts)
  File "/home/ec2-user/venv39/lib/python3.9/site-packages/mii/client.py", line 125, in query
    response = self.asyncio_loop.run_until_complete(
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/ec2-user/venv39/lib/python3.9/site-packages/mii/client.py", line 109, in _query_in_tensor_parallel
    await responses[0]
  File "/home/ec2-user/venv39/lib/python3.9/site-packages/mii/client.py", line 72, in _request_async_response
    proto_response = await getattr(self.stub, conversions["method"])(proto_request)
  File "/home/ec2-user/venv39/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception calling application: at 58:24:
def _fwd_kernel(
    Q,
    K,
    V,
    sm_scale,
    TMP,
    Out,
    stride_qz,
    stride_qh,
    stride_qm,
    stride_qk,
    stride_kz,
    stride_kh,
    stride_kn,
    stride_kk,
    stride_vz,
    stride_vh,
    stride_vk,
    stride_vn,
    stride_oz,
    stride_oh,
    stride_om,
    stride_on,
    Z,
    H,
    N_CTX,
    BLOCK_M: tl.constexpr,
    BLOCK_DMODEL: tl.constexpr,
    BLOCK_N: tl.constexpr,
):
    start_m = tl.program_id(0)
    off_hz = tl.program_id(1)
    # initialize offsets
    offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_n = tl.arange(0, BLOCK_N)
    offs_d = tl.arange(0, BLOCK_DMODEL)
    off_q = off_hz * stride_qh + offs_m[:, None] * stride_qm + offs_d[None, :] * stride_qk
    off_k = off_hz * stride_kh + offs_n[:, None] * stride_kn + offs_d[None, :] * stride_kk
    off_v = off_hz * stride_vh + offs_n[:, None] * stride_qm + offs_d[None, :] * stride_qk
    # Initialize pointers to Q, K, V
    q_ptrs = Q + off_q
    k_ptrs = K + off_k
    v_ptrs = V + off_v
    # initialize pointer to m and l
    t_ptrs = TMP + off_hz * N_CTX + offs_m
    m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float("inf")
    l_i = tl.zeros([BLOCK_M], dtype=tl.float32)
    acc = tl.zeros([BLOCK_M, BLOCK_DMODEL], dtype=tl.float32)
    # load q: it will stay in SRAM throughout
    q = tl.load(q_ptrs)
    # loop over k, v and update accumulator
    for start_n in range(0, N_CTX, BLOCK_N):
        start_n = tl.multiple_of(start_n, BLOCK_N)
        # -- compute qk ----
        k = tl.load(k_ptrs + start_n * stride_kn)

        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
        qk += tl.dot(q, k, trans_b=True)
                        ^"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Exception calling application: at 58:24:\ndef _fwd_kernel(\n    Q,\n    K,\n    V,\n    sm_scale,\n    TMP,\n    Out,\n    stride_qz,\n    stride_qh,\n    stride_qm,\n    stride_qk,\n    stride_kz,\n    stride_kh,\n    stride_kn,\n    stride_kk,\n    stride_vz,\n    stride_vh,\n    stride_vk,\n    stride_vn,\n    stride_oz,\n    stride_oh,\n    stride_om,\n    stride_on,\n    Z,\n    H,\n    N_CTX,\n    BLOCK_M: tl.constexpr,\n    BLOCK_DMODEL: tl.constexpr,\n    BLOCK_N: tl.constexpr,\n):\n    start_m = tl.program_id(0)\n    off_hz = tl.program_id(1)\n    # initialize offsets\n    offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)\n    offs_n = tl.arange(0, BLOCK_N)\n    offs_d = tl.arange(0, BLOCK_DMODEL)\n    off_q = off_hz * stride_qh + offs_m[:, None] * stride_qm + offs_d[None, :] * stride_qk\n    off_k = off_hz * stride_kh + offs_n[:, None] * stride_kn + offs_d[None, :] * stride_kk\n    off_v = off_hz * stride_vh + offs_n[:, None] * stride_qm + offs_d[None, :] * stride_qk\n    # Initialize pointers to Q, K, V\n    q_ptrs = Q + off_q\n    k_ptrs = K + off_k\n    v_ptrs = V + off_v\n    # initialize pointer to m and l\n    t_ptrs = TMP + off_hz * N_CTX + offs_m\n    m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float(\"inf\")\n    l_i = tl.zeros([BLOCK_M], dtype=tl.float32)\n    acc = tl.zeros([BLOCK_M, BLOCK_DMODEL], dtype=tl.float32)\n    # load q: it will stay in SRAM throughout\n    q = tl.load(q_ptrs)\n    # loop over k, v and update accumulator\n    for start_n in range(0, N_CTX, BLOCK_N):\n        start_n = tl.multiple_of(start_n, BLOCK_N)\n        # -- compute qk ----\n        k = tl.load(k_ptrs + start_n * stride_kn)\n\n        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)\n        qk += tl.dot(q, k, trans_b=True)\n                        ^", grpc_status:2, created_time:"2023-04-26T14:05:06.751619613+00:00"}"
>

@CrossNox
Copy link
Author

Hi, I'm closing this. I kinda "solved" it by changing the GPU. I was using a T4, but with an A10G it mostly works. I think the issue is that some kernels are not compatible with architectures older than Ampere.

@mrwyattii
Copy link
Contributor

@CrossNox you will need to use an older dev release of triton (we haven't finished updating DeepSpeed to use the latest Triton 2.0.0 release): https://pypi.org/project/triton/2.0.0.dev20221202/

And yes I installed from source. You can just do pip install git+https://github.com/microsoft/deepspeed

@CrossNox
Copy link
Author

@mrwyattii yes, I tried that as well. But there was no case until I figured out that the issue was the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants