Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Speedster] With Hugging Face notebook code on nebulydocker/nebullvm container: RuntimeError: Expected all tensors to be on the same device #349

Open
trent-s opened this issue Jun 21, 2023 · 5 comments

Comments

@trent-s
Copy link

trent-s commented Jun 21, 2023

Hi! Thank you for your continued work with this project! I would like to report a possible TensorFlow GPU configuration issue with the documented nebulydocker/nebullvm container that appears to prevent notebook code from running.

I am trying to use code in the Hugging Face notebook found at
https://github.com/nebuly-ai/nebuly/blob/main/optimization/speedster/notebooks/huggingface/Accelerate_Hugging_Face_PyTorch_BERT_with_Speedster.ipynb

And am running in the current nebulydocker/nebullvm docker container documented at
https://docs.nebuly.com/Speedster/installation/#optional-download-docker-images-with-frameworks-and-optimizers

Here is exact Python code I am trying to run (essentially code from the notebook with a couple of diagnostic lines added.):

#!/usr/bin/python
import os
import torch
from transformers import BertTokenizer, BertModel
import random
from speedster import optimize_model

tensorrt_path = "/usr/local/lib/python3.8/dist-packages/tensorrt"

if os.path.exists(tensorrt_path):
    os.environ['LD_LIBRARY_PATH'] += f":{tensorrt_path}"
else:
    print("Unable to find TensorRT path. ONNXRuntime won't use TensorrtExecutionProvider.")

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', torchscript=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()

sentences = [
    "Mars is the fourth planet from the Sun.",
    "has a crust primarily composed of elements",
    "However, it is unknown",
    "can be viewed from Earth",
    "It was the Romans",
]

len_dataset = 100

texts = []
for _ in range(len_dataset):
    n_times = random.randint(1, 30)
    texts.append(" ".join(random.choice(sentences) for _ in range(n_times)))

encoded_inputs = [tokenizer(text, return_tensors="pt") for text in texts]

dynamic_info = {
    "inputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'},
    ],
    "outputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch'},
    ]
}

optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["onnx_tensor_rt","onnx_tvm","onnxruntime","tensor_rt", "tvm"],
    device=str(device),
    dynamic_info=dynamic_info,
)

print ("Type of optimized model: "+str(type(optimized_model)) + " on device: "+str(optimized_model.device))

encoded_inputs = [tokenizer(text, return_tensors="pt").to(device) for text in texts]

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = model(**encoded_input)

print (final_out)

Just in case it is useful, starting up the container looks like this:

$ docker run -ti --rm -v ~/data:/data -v ~/src:/src --gpus=all nebulydocker/nebullvm:latest

=====================
== NVIDIA TensorRT ==
=====================

NVIDIA Release 23.03 (build 54538654)
NVIDIA TensorRT Version 8.5.3
Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

https://developer.nvidia.com/tensorrt

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh

To install the open-source samples corresponding to this TensorRT release version
run /opt/tensorrt/install_opensource.sh.  To build the open source parsers,
plugins, and samples for current top-of-tree on master or a different branch,
run /opt/tensorrt/install_opensource.sh -b <branch>
See https://github.com/NVIDIA/TensorRT for more information.

And, this is the output that I get running the above code:

│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
2023-06-21 07:44:32.387780: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point r
ound-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-21 07:44:32.437353: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-cri
tical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-21 07:44:34.329062: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned a
bove are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries
 for your platform.
Skipping registering GPU devices...
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.b
ias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', '
cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequ
enceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassifica
tion model from a BertForSequenceClassification model).
2023-06-21 07:44:42 | INFO     | Running Speedster on GPU:0
2023-06-21 07:44:46 | INFO     | Benchmark performance of original model
2023-06-21 07:44:47 | INFO     | Original model latency: 0.011019186973571777 sec/iter
============= Diagnostic Run torch.onnx.export version 2.0.0+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

2023-06-21 07:44:53 | INFO     | [1/2] Running PyTorch Optimization Pipeline
2023-06-21 07:44:53 | INFO     | Optimizing with PytorchBackendCompiler and q_type: None.
2023-06-21 07:44:54 | WARNING  | Unable to trace model with torch.fx
2023-06-21 07:46:04 | INFO     | Optimized model latency: 0.007783412933349609 sec/iter
2023-06-21 07:46:04 | INFO     | Optimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.
2023-06-21 07:46:04 | WARNING  | Unable to trace model with torch.fx
2023-06-21 07:47:44 | INFO     | Optimized model latency: 0.007919073104858398 sec/iter
2023-06-21 07:47:44 | INFO     | [2/2] Running ONNX Optimization Pipeline

[Speedster results on Tesla V100-PCIE-32GB]
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Metric      ┃ Original Model   ┃ Optimized Model   ┃ Improvement   ┃
┣━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━┫
┃ backend     ┃ PYTORCH          ┃ TorchScript       ┃               ┃
┃ latency     ┃ 0.0110 sec/batch ┃ 0.0078 sec/batch  ┃ 1.42x         ┃
┃ throughput  ┃ 90.75 data/sec   ┃ 128.48 data/sec   ┃ 1.42x         ┃
┃ model size  ┃ 438.03 MB        ┃ 438.35 MB         ┃ 0%            ┃
┃ metric drop ┃                  ┃ 0                 ┃               ┃
┃ techniques  ┃                  ┃ fp32              ┃               ┃
┗━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━┛

Max speed-up with your input parameters is 1.42x. If you want to get a faster optimized model, see the following link for some suggestions: https://docs.nebuly.com/Speedster/advanced_options/#acceleration-suggestions

Type of optimized model: <class 'nebullvm.operations.inference_learners.huggingface.HuggingFaceInferenceLearner'> on device: None
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /src/./sample.py:68 in <module>                                                                  │
│                                                                                                  │
│   65 # Warmup for 30 iterations                                                                  │
│   66 for encoded_input in encoded_inputs[:30]:                                                   │
│   67 │   with torch.no_grad():                                                                   │
│ ❱ 68 │   │   final_out = model(**encoded_input)                                                  │
│   69                                                                                             │
│   70 print (final_out)                                                                           │
│   71                                                                                             │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1501 in _call_impl             │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py:1013 in forward │
│                                                                                                  │
│   1010 │   │   # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x s  │
│   1011 │   │   head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)          │
│   1012 │   │                                                                                     │
│ ❱ 1013 │   │   embedding_output = self.embeddings(                                               │
│   1014 │   │   │   input_ids=input_ids,                                                          │
│   1015 │   │   │   position_ids=position_ids,                                                    │
│   1016 │   │   │   token_type_ids=token_type_ids,                                                │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1501 in _call_impl             │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py:230 in forward  │
│                                                                                                  │
│    227 │   │   │   │   token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.  │
│    228 │   │                                                                                     │
│    229 │   │   if inputs_embeds is None:                                                         │
│ ❱  230 │   │   │   inputs_embeds = self.word_embeddings(input_ids)                               │
│    231 │   │   token_type_embeddings = self.token_type_embeddings(token_type_ids)                │
│    232 │   │                                                                                     │
│    233 │   │   embeddings = inputs_embeds + token_type_embeddings                                │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1501 in _call_impl             │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py:162 in forward                 │
│                                                                                                  │
│   159 │   │   │   │   self.weight[self.padding_idx].fill_(0)                                     │
│   160 │                                                                                          │
│   161 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 162 │   │   return F.embedding(                                                                │
│   163 │   │   │   input, self.weight, self.padding_idx, self.max_norm,                           │
│   164 │   │   │   self.norm_type, self.scale_grad_by_freq, self.sparse)                          │
│   165                                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:2210 in embedding                  │
│                                                                                                  │
│   2207 │   │   #   torch.embedding_renorm_                                                       │
│   2208 │   │   # remove once script supports set_grad_enabled                                    │
│   2209 │   │   _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)                    │
│ ❱ 2210 │   return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)        │
│   2211                                                                                           │
│   2212                                                                                           │
│   2213 def embedding_bag(                                                                        │

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

Attempting to call the model appears to cause the final RuntimeError

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

This seems like it may be related to optimized_model.device being none.

Just FYI, GPU seems to be accessible on this container:

# nvidia-smi
Wed Jun 21 09:05:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-32GB            Off| 00000000:AF:00.0 Off |                    0 |
| N/A   33C    P0               23W / 250W|      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-32GB            Off| 00000000:D8:00.0 Off |                    0 |
| N/A   32C    P0               24W / 250W|      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
# python -c "import torch; print(torch.cuda.is_available())"
True

Thank you for looking at this.

@SuperSecureHuman
Copy link
Contributor

Same issue

Looking further, the model is getting into CPU after going through speedster (?)

Changing the model to cuda manually before inference works. But model detaching from GPU is not the expected behaviour.

image

image

I am not sure if this has something to do with the model being detached

image

@trent-s
Copy link
Author

trent-s commented Jul 12, 2023

Thank you very much for taking a look at this. That is a good point. The "cannot dlopen some GPU libraries" message sounds serious.

I have a question about the workaround you suggested. I tried to perform optimized_model.to(device) to force the model to the gpu, but as the following output shows, there was no .to() method.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/./hGPT2gpu.py:67 in <module>                                                               │
│                                                                                                  │
│    64                                                                                            │
│    65 print ("Type of optimized model: "+str(type(optimized_model)) + " on device: "+str(optim   │
│    66 print("moving model to gpu")                                                               │
│ ❱  67 optimized_model.to(device)                                                                 │
│    68 print ("Type of optimized model: "+str(type(optimized_model)) + " on device: "+str(optim   │
│    69                                                                                            │
│    70 # print (dir(optimized_model))                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'HuggingFaceInferenceLearner' object has no attribute 'to'

Is there another way to move the model to cuda? Thanks!

@SuperSecureHuman
Copy link
Contributor

It's inference Learner object

I am not exactly sure how to move it, but a higher level view would be to get the model out of the inference learner, and move it to gpu

@trent-s
Copy link
Author

trent-s commented Jul 12, 2023

Thanks! That sounds like a good suggestion. I will try that!

@lei-rs
Copy link

lei-rs commented Jul 27, 2023

Seems related to pytorch/pytorch#72175, solution is to first export to onnx on CPU, then optimize it on the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants