Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use l2l.algorithms.MAML correctly with nn.DistributedDataParallel? #170

Closed
AyanamiReiFan opened this issue Aug 13, 2020 · 13 comments
Closed

Comments

@AyanamiReiFan
Copy link

This work is awesome!

Using nn.DistributedDataParallel in the following way will raise Error when execute learner = maml.clone()
How to use it correctly? Should I use nn.DistributedDataParallel on MyModel and then use MAML?
Thanks!

model = MyModel()
maml = l2l.algorithms.MAML(model, lr=0.5)
model = nn.DistributedDataPrallel(model, device_ids=[rank])
...
learner = maml.clone()

@seba-1511
Copy link
Member

Hello @AyanamiReiFan, and thanks for the kind words.

Parallelizing MAML with DistributedDataParallel is a bit tricky as the implementation relies on gradient hooks which don't play well with clone/grad. If you want to use torch.distributed to parallelize the training loop, cherry's Distributed optimizer is another option:

opt = optim.Adam(model.parameters())
opt = Distributed(model.parameters(), opt, sync=1)
# Training code
opt.step()

If you want to parallelize the model over GPUs, I would use torch.nn.DataParallel:

learner = maml.clone()
learner = torch.nn.DataParallel(learner, device_ids=[0, 1])
# Training code

Let me know if you ever find a solution to using DistributedDataParallel, I'd be curious to know the solution.

@AyanamiReiFan
Copy link
Author

Hello @AyanamiReiFan, and thanks for the kind words.

Parallelizing MAML with DistributedDataParallel is a bit tricky as the implementation relies on gradient hooks which don't play well with clone/grad. If you want to use torch.distributed to parallelize the training loop, cherry's Distributed optimizer is another option:

opt = optim.Adam(model.parameters())
opt = Distributed(model.parameters(), opt, sync=1)
# Training code
opt.step()

If you want to parallelize the model over GPUs, I would use torch.nn.DataParallel:

learner = maml.clone()
learner = torch.nn.DataParallel(learner, device_ids=[0, 1])
# Training code

Let me know if you ever find a solution to using DistributedDataParallel, I'd be curious to know the solution.

Thanks very much!
My main target is to accelerate the training by using multi GPUs.
I have tried to use torch.nn.DataParallel to parallelize the model over GPUs, but after learner = torch.nn.DataParallel(learner, device_ids=[0, 1]), I have to call learner.module.adapt to execute the adapt action, this usage is not paralleled. Do you have any suggestion about it?

@AyanamiReiFan
Copy link
Author

I'm training a 1-Way 5-Shot Segmentation model on MAML, so the batch size on training can only be 5.

So I think it will not speed up a lot by parallelizing the adapt and evaluate action in each iteration in

learner = maml.clone()
learner = torch.nn.DataParallel(learner, device_ids=[0, 1])

This is why I tried to parallelize the MAML module and want to let the MAML to calculate different batch on many gpus, but it seems that my effort to use DistributedDataParallel is wrong, Do you have any suggestion about it?

Thanks very much!

@janbolle
Copy link
Contributor

Hello @AyanamiReiFan,
actually I wrote a paper about parallelizing MAML.
I implemented it using Ray. I used n separate learners (n would be the number of GPUs you want to use). After training separately I averaged the weights of the learners in a central learner.

Maybe this helps.

@AyanamiReiFan
Copy link
Author

Thanks very much! @janbolle
It's really helpful!

@Kulbear
Copy link

Kulbear commented Sep 3, 2020

@janbolle That's an exciting work!
A few questions around your paper:

  1. It is implemented using Ray, so I suppose it works on CPU perfectly. Have you tried to do a GPU version of it?
  2. Do you have plan to release the implementation publicly?

Thank you!

@janbolle
Copy link
Contributor

janbolle commented Sep 5, 2020

@Kulbear

  1. I did not use a GPU version as the batch-sizes and NNs are relatively small and I suppose this won't speed things up much in this setting - but you could easily implement it since Ray also supports GPU
  2. I only did experiments for Regression and Classification. Also, the implementation is done using TF2.0. Would the implementation be helpful for you?

@Kulbear
Copy link

Kulbear commented Sep 5, 2020

@janbolle Thanks for the reply!

  1. Got it, I can give it a try.
  2. This one is more on my personal curiosity :) So if you are willing to share that would be nice please feel free to do it or not at your own convenience! I'm pretty new to TF2.0. If I remember correctly I turned to PyTorch after... well... maybe TF 1.9...

:D

@seba-1511
Copy link
Member

Closing since dormant. Feel free to reopen.

@zhaozj89
Copy link

zhaozj89 commented Nov 14, 2020

I have a large batch that cannot fit into one 2080Ti GPU (11G). I have tried:

learner = maml.clone()
learner = torch.nn.DataParallel(learner, device_ids=[0, 1])
# Training code

But all memory still goes to one GPU. Is there an easy way to get around this? Thanks.

@seba-1511
Copy link
Member

@zhaozj89 This worked for me:

learner = model.clone()
learner.module = torch.nn.DataParallel(learner.module, device_ids=[0, 1])

@brando90
Copy link

Hello @AyanamiReiFan, and thanks for the kind words.
Parallelizing MAML with DistributedDataParallel is a bit tricky as the implementation relies on gradient hooks which don't play well with clone/grad. If you want to use torch.distributed to parallelize the training loop, cherry's Distributed optimizer is another option:

opt = optim.Adam(model.parameters())
opt = Distributed(model.parameters(), opt, sync=1)
# Training code
opt.step()

If you want to parallelize the model over GPUs, I would use torch.nn.DataParallel:

learner = maml.clone()
learner = torch.nn.DataParallel(learner, device_ids=[0, 1])
# Training code

Let me know if you ever find a solution to using DistributedDataParallel, I'd be curious to know the solution.

Thanks very much!
My main target is to accelerate the training by using multi GPUs.
I have tried to use torch.nn.DataParallel to parallelize the model over GPUs, but after learner = torch.nn.DataParallel(learner, device_ids=[0, 1]), I have to call learner.module.adapt to execute the adapt action, this usage is not paralleled. Do you have any suggestion about it?

see this thread: #197 it seems you can use a lighting wrapper to parallelize MAML. Not tried it myself yet...but I assume it works. Seems DDP is tricky to work for technical reasons I don't understant.

@SungFeng-Huang
Copy link

I'm training a 1-Way 5-Shot Segmentation model on MAML, so the batch size on training can only be 5.

So I think it will not speed up a lot by parallelizing the adapt and evaluate action in each iteration in

learner = maml.clone()
learner = torch.nn.DataParallel(learner, device_ids=[0, 1])

This is why I tried to parallelize the MAML module and want to let the MAML to calculate different batch on many gpus, but it seems that my effort to use DistributedDataParallel is wrong, Do you have any suggestion about it?

Thanks very much!

Hi, here's my implementation of ParallellMAML using Learn2Learn's LightningMAML + PyTorch Lightning DDP: https://gist.github.com/SungFeng-Huang/dec22eef5650f5a74d24a732ffd0080f
It should work with adding the argument "--meta_task_ddp".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants