Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI TRUNCATED when run imagenet dataset on resnet18 #1

Open
GeKeShi opened this issue May 24, 2019 · 2 comments
Open

MPI TRUNCATED when run imagenet dataset on resnet18 #1

GeKeShi opened this issue May 24, 2019 · 2 comments

Comments

@GeKeShi
Copy link

GeKeShi commented May 24, 2019

hello, I‘m trying to test this code on imagenet, but I find that when the program runs to self.comm.Bcast([self.model_recv_buf.recv_buf[layer_idx], MPI.DOUBLE], root=0) in function async_fetch_weights_bcast in distributed_worker.py at the first step, it thrown an error that is MPI_ERR_TRUNCATE: message truncated , but I check the memory size in Bcast and it works when the program ran on Cifar10/100, have u encountered this problem?

And another issue: then I replaced the Pytorch0.3.0 with Pytorch0.4/1.1, the proceeding time on decode of QSGD is significantly higher than 0.3.0, almost 10 times than it, have u tried this?

@hwang595
Copy link
Owner

hwang595 commented Jun 2, 2019

@GeKeShi sorry for this late response.

i) For the first issue your reported. The error usually occurs when the sizes your local receiving buffer (on PS or worker nodes) are smaller than the sizes of the messages sent from other source nodes. But that issue should only related to the model you're using (however it seems not to be the case on your end i.e. it works for CIFAR-10/100, but not for ImageNet). Can you share more details on this issue? e.g. pointing your fork to me. Also, the following change might help, i.e. changing the async_fetch_weights_bcast function to this version: https://github.com/hwang595/ps_pytorch/blob/master/src/distributed_worker.py#L221-L231. i.e. compressing the model using a lossless compression tool. In that way, each node can maintain a smaller receiving buffer locally. But please note that, you also need corresponding change on the PS end, i.e. https://github.com/hwang595/ps_pytorch/blob/master/src/sync_replicas_master_nn.py#L218-L225.

ii) The decoding function was written in Numpy. So I don't have a clue currently why the version of PyTorch can influence your speed. Can you also share more details about the Numpy and Python version information on your end? Also, did you try to locate which part is the bottleneck of the performance?

Hope these are helpful.

@GeKeShi
Copy link
Author

GeKeShi commented Jun 10, 2019

Thanks for your reply, here are some details
i) I change the model implementation with that from torchvision and the size mismatching was solved
ii) the Numpy is 1.12.1 and Pytorch is 1.1.0, the program output in training cifar10 in resnet18 is as follow:

Worker: 2, Step: 98, Epoch: 0 [3104/50000 (6%)], Loss: 1.8777, Time Cost: 5.1246, Comp: 0.0225, Encode:  4.9213, Comm:  0.0906, Msg(MB):  25.5414, Prec@1:  25.0000, Prec@5:  87.5000
Worker: 1, Step: 98, Epoch: 0 [3104/50000 (6%)], Loss: 2.0034, Time Cost: 5.2109, Comp: 0.0224, Encode:  4.9971, Comm:  0.1011, Msg(MB):  25.5444, Prec@1:  18.7500, Prec@5:  81.2500
Master: Step: 98, Decode Cost: 130.71253109, Cur lr 0.0095, Gather: 5.15524792671

meanwhile, the output from pytorch0.3.0 is:

Worker: 1, Step: 432, Epoch: 1 [5120/50000 (10%)], Loss: 1.0311, Time Cost: 6.9317, Comp: 0.6725, Encode:  5.8544, Comm:  0.1337, Msg(MB):  30.4401, Prec@1:  64.0625, Prec@5:  96.0938
Worker: 2, Step: 432, Epoch: 1 [5120/50000 (10%)], Loss: 1.2026, Time Cost: 7.0075, Comp: 0.8195, Encode:  5.7536, Comm:  0.1939, Msg(MB):  30.4277, Prec@1:  53.9062, Prec@5:  93.7500
Master: Step: 432, Decode Cost: 14.2414638996, Cur lr 0.00663420431289, Gather: 6.93086600304

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants