New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Requirements #15
Comments
These are big models, so yes, they are memory intensive. We trained with 4 Quadro 6000 cards that have 24 GB memory each. Alternatively you can distribute across 8 GPUs with 12 GB of memory each. About 100 GBs of memory would be roughly the total amount of memory needed to train with our settings in terms of number of datasets, resolution, and batch size. If you don't have access to that kind of infrastructure, you could use gradient accumulation on fewer cards. The second issue has to do with how we hook into timm: we monkey patch an additional method onto the object, so that we don't need to modify the original library source. See here for an example: https://github.com/intel-isl/DPT/blob/72830e11c7e72f58aee1465ab5207e3d0f0ab9fd/dpt/vit.py#L343 Unfortunately, this strategy doesn't play well with nn.DataParallel. AFAIK this is an open issue with DataParallel. See here for a similar discussion: Cadene/pretrained-models.pytorch#112. I recommend to either switch to DistributedDataParallel, where this issue doesn't occur, or to rewrite the backbone so that this monkey patching isn't required. |
Thanks, @ranftlr for the detailed response. Yes. I later found out that the reason for that. Further, I also noticed that register_forward_hook operation doesn't go well with nn.dataparallel as mentioned here: register-forward-hook-with-multiple-gpus. I worked around it and used the nn.Modulelist to store each of the blocks and then took the features from the corresponding block in the forward operation. This totally bypassed the register_forward_hook function and I can run the model in parallel now. |
Great, thanks for pointing out this solution. I'm closing this issue for now. |
@krantiparida Hi,I have run into the same issue. Could you please share the rewritten version of dpt model? |
@Tord-Zhang I have modified the code as per my requirement. I am not sure if that will be useful to you. However, I am attaching below the DPT model part. The other functions used are similar to the ones mentioned in this repo.
|
@krantiparida hi, as far as I am concerned, this code cannot solve the problem cause by self.patch_embed ? |
@Tord-Zhang yes. This will not address the problem of self.patch_embed. In my case, I did not require self.patch_embed. However, I think you can also implement the same using nn.Modulelist as well. |
Add DDP multi-GPU training code
Hi, I was trying to use your code for training in another dataset for the depth prediction task. I noticed that during training I could not increase the batch size beyond 2. With a batch size of 2 and images of size 224x448 it takes almost 9GB of memory. Can you comment on the memory requirement? Like how did you train the model and how much memory it took? It will be really helpful if you can share some insights on training.
Thanks
The text was updated successfully, but these errors were encountered: