-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI TRUNCATED when run imagenet dataset on resnet18 #1
Comments
@GeKeShi sorry for this late response. i) For the first issue your reported. The error usually occurs when the sizes your local receiving buffer (on PS or worker nodes) are smaller than the sizes of the messages sent from other source nodes. But that issue should only related to the model you're using (however it seems not to be the case on your end i.e. it works for CIFAR-10/100, but not for ImageNet). Can you share more details on this issue? e.g. pointing your fork to me. Also, the following change might help, i.e. changing the ii) The decoding function was written in Numpy. So I don't have a clue currently why the version of PyTorch can influence your speed. Can you also share more details about the Numpy and Python version information on your end? Also, did you try to locate which part is the bottleneck of the performance? Hope these are helpful. |
Thanks for your reply, here are some details
meanwhile, the output from pytorch0.3.0 is:
|
hello, I‘m trying to test this code on imagenet, but I find that when the program runs to
self.comm.Bcast([self.model_recv_buf.recv_buf[layer_idx], MPI.DOUBLE], root=0)
in functionasync_fetch_weights_bcast
in distributed_worker.py at the first step, it thrown an error that isMPI_ERR_TRUNCATE: message truncated
, but I check the memory size in Bcast and it works when the program ran on Cifar10/100, have u encountered this problem?And another issue: then I replaced the Pytorch0.3.0 with Pytorch0.4/1.1, the proceeding time on decode of QSGD is significantly higher than 0.3.0, almost 10 times than it, have u tried this?
The text was updated successfully, but these errors were encountered: