Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: 0 <= device.index() && device.index() < static_cast<c10::DeviceIndex>(device_ready_queues_.size()) INTERNAL ASSERT FAILED at "/build/pytorch/torch/csrc/autograd/engine.cpp":1418 #571

Open
SoldierWz opened this issue Mar 26, 2024 · 2 comments
Assignees
Labels
ARC ARC GPU

Comments

@SoldierWz
Copy link

Describe the bug

When this problem occurred, I tried to disable the CPU core, and then I could run normally, but the running results were very poor, the accuracy dropped sharply and the training time became longer. I have submitted this issue #565. Then when I restored the CPU core, the above error occurred.
Here is the part where the problem occurs.
device = 'xpu'
for train_idx, test_idx in kf.split(X_tensor):
X_train, X_test = X_tensor[train_idx], X_tensor[test_idx]
y_train, y_test = y_tensor[train_idx], y_tensor[test_idx]

train_dataset = CustomDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

model = MLP(X_train.shape[1]) 
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

model = model.to("xpu")
criterion = criterion.to("xpu")
model, optimizer = ipex.optimize(model, optimizer=optimizer)
for epoch in range(1000):
    model.train() 
    for features, labels in train_loader:
        features, labels = features.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(features)
        loss = criterion(outputs, labels)
        **loss.backward()**
        optimizer.step()

Versions

wget https://raw.githubusercontent.com/intel/intel-extension-for-pytorch/master/scripts/collect_env.py

For security purposes, please check the contents of collect_env.py before running it.

python collect_env.py

@jgong5
Copy link
Contributor

jgong5 commented Mar 26, 2024

May I know what you mean by "disable CPU core"? It sounds like no GPU was found according to the error message. But we should report more meaningful error messages. cc @gujinghui

@SoldierWz
Copy link
Author

May I know what you mean by "disable CPU core"? It sounds like no GPU was found according to the error message. But we should report more meaningful error messages. cc @gujinghui

I edited the GRUB configuration file
Change GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
Changed to GRUB_CMDLINE_LINUX_DEFAULT="nohz=off"
There is another line which is GRUB_CMDLINE_LINUX="i915.enable_hangcheck=0" which I did not change.
After editing like this, the GPU can be used
But I just tried and a new problem occurred. The error is reported below.

ImportError Traceback (most recent call last)
Cell In[2], line 8
6 import modin.pandas as pd
7 import numpy as np
----> 8 import torch
9 import intel_extension_for_pytorch as ipex
10 import torch.nn as nn

File ~/mambaforge/envs/pytorch-arc/lib/python3.11/site-packages/torch/init.py:235
233 if USE_GLOBAL_DEPS:
234 _load_global_deps()
--> 235 from torch._C import * # noqa: F403
237 # Appease the type checker; ordinarily this binding is inserted by the
238 # torch._C module initialization code in C
239 if TYPE_CHECKING:

ImportError: /home/wangzhen/mambaforge/envs/pytorch-arc/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent

@ZhaoqiongZ ZhaoqiongZ added the ARC ARC GPU label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARC ARC GPU
Projects
None yet
Development

No branches or pull requests

4 participants