New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conserve gpu memory by storing history on cpu memory instead #3
Conversation
1122bc8
to
bb4978f
Compare
This patch offloads AADL history to the cpu memory instead of using valuable gpu memory. This incurs a performance hit of transferring the vectors to and from cpu memory, but allows for training with a smaller reduction in batch size than without the patch and not run out of memory. - This can probably be ameliorated by interleaving the memory transfers with the computation. This change also fixes a bug with `torch.nn.utils.convert_parameters.vector_to_parameters` where it does not preserve the `memory_format` of `param.data`. History device offload is configurable by the user so that they can continue to use gpu memory for history if they prefer that for some reason instead (by using `accelerate(..., history_device="cuda")`. For reference, I get errors like this without cpu memory offload: ``` RuntimeError: CUDA out of memory. Tried to allocate 2.55 GiB (GPU 0; 24.00 GiB total capacity; 16.60 GiB already allocated; 1.82 GiB free; 19.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ```
Hi @allaffa, Thanks for merging this in, but this pull request included the change to line 19, you may want to revert that change specifically since you mentioned here: #2 (comment) that the original line was correct. But as I mentioned in the reply in that issue, if you revert my change to line 19 back to the original, the code will error out with |
HI @henrymai The matrix
So the right hand side must be This typo was not in the GitLab repository, but after a few changes done to make the code clear showed up in the first ORNL/GitHub relies. Thank you for pointing this out. |
Ok, I'll submit a new pull request to change the right hand side to |
Besides improvement in time, did you also notice any changes in the convergence of the training? If the memory.contiguous format is NOT enforced as you proposed in the PR, the least-squares solved to compute the mixing coefficients for Anderson should still be correct, because the switch between contiguous vs. non-contiguous introduces a permutation on the rows of the matrix DR and the right hand side DX[-1]. However, the permutation on the rows does not change the solution to the least-squares. My colleague @jqyin is running the new version of the code on SUMMIT and cannot reproduce the same results obtained with the old version of the code. |
This patch was not for an improvement in time. In regards to noticing a convergence difference, I don't have a baseline since the original would eat up too much gpu memory for me to actually train for long enough (and also it would throw an exception when it hit the typo), but as I mention below, there shouldn't be a functional difference.
Right, contiguous or not it shouldn't change the results, its just that I get
That is strange, it should be functionally equivalent to the original. Could there be some other changes that might not have made it over from the gitlab repo to the github repo? This is referring to your comment earlier:
If possible, can your colleague try reverting my changes (and only fix the original typo) and see if they can reproduce the results using the original github code? |
Actually my colleague @jqyin is noticing that offloading to CPU memory and reloading to GPU every time instead of keeping everything stored on GPU reduced the time for training ResNet50 on ImageNet1k by a 3x factor.
Can you give me some indication about what type of neural network architecture you are using, and what type of data you train the neural network on?
We are going to double check between the two version of the code and keep you informed about possible difference. If found any, we will open a PR to introduced missing pieces. Thanks. |
Wow, that is very interesting, I have a bunch of questions about Summit out of curiosity. Is the latency on the NVLink bus between the POWER9 processors and the GPUs fairly low compared to pcie? Did @jqyin increase the batch size due to having more gpu memory usable, to increase the training speed? Also wondering if Summit has true unified memory between the cpus and gpus where no movement actually happens or if it is just "fake" unified memory where the transfers are managed behind the scenes by cuda? I also have a hunch that eventually, in addition to
My model consists mostly of convolution layers and my inputs are 3x128x128 images. EDIT:
Great, will be very interested to hear the results. |
Are you guys using fp16 weights on the GPU? |
No. We are using fp32 single-precision floating-point. |
Just as an additional sanity check, have you guys also tested setting both history_device and compute_device to GPU? |
Both these two following situations do not work:
The only situation that works is the one we were already using before, with both history_device and compute_device on the CPU |
Just to clarify, you mean before as in this comment: #3 (comment) ,right? Otherwise, if you meant before as in prior to my patch, that is actually the same as history_device = GPU and compute_device = GPU (assuming that you guys were training your models using GPU before). Can you guys test locally reverting my patch |
When training ResNet50 on ImageNet1k, the older version of the code could run only on CPUs because the code would return an out of memory on the GPU. |
Oh ok, that makes sense now.
Ok, great, this confirms that my patch is indeed equivalent to before (just with added flexibility).
I think this means that the linear algebra routines being computed on the gpu are returning results different than what the cpu returns, I think for now, the history_device and compute_device should be set to cpu (I can open a pull request for this, or you guys can do it). Then we should investigate the differences in the results of the linear algebra routines between the cpu and gpu (likely something that we will need pytorch to fix, assuming that you guys are on the latest version of pytorch already). |
Here's the pull request to default to cpu for compute_device for now: https://github.com/ORNL/AADL/pull/6/files |
I think that the PyTorch version that @jqyin is using on Summit is not the latest one. The machine is very complex, and the software stack is not very stable. So most of users need to install their own PyThon environment. |
Yeah, there have been a lot of fixes over time in pytorch to their linear algebra routines, so this might be the reason for the differences.
I can also try the same thing when I have time, but I will be doing it from the latest nvidia NGC pytorch container. |
for the record, I'm using PyTorch v1.9.0 on Summit for those experiments. |
Ok, that is actually fairly recent, I don't think 1.10 would have changed anything in regards to this. |
I tried to make a minimal self contained example (mnist) to reproduce the differences that you're seeing between AADL gpu vs AADL cpu. I am able to see that there is a loss/accuracy difference between AADL gpu vs AADL cpu, but both of them still converge to an 80%+ test accuracy. Meaning, the computations are definitely producing different results on cpu vs gpu, but they are not causing the gpu version to not converge in my example. Git repo for the example: https://github.com/henrymai/aadl_testing To rule out hardware differences causing problems, can you try running my small example above on your hardware to see if you see similar results as what I published in the git repo? |
@henrymai thanks for the test. I run your test on Summit and got similar results: jsrun -n1 -a1 -g1 python main.py --AADL_gpu jsrun -n1 -a1 -g1 python main.py --AADL_cpu jsrun -n1 -a1 -g1 python main.py I guess for the complicated problem such as imagenet training and/or distributed training, the differences are magnified. |
Do you guys at ORNL happen to have a support contract/relationship with nvidia? I think this problem is probably down to their cuBLAS/cuSolver kernels that pytorch is calling into. They might also have suggestions on how to mitigate this problem. |
I wonder if the problem is general to all PyTorch or just to This is possible by setting |
I actually almost immediately get an out of memory error when trying
Same with even using |
This is line of code is exactly the same as the one used to update the solution with LSTQ. I don't understand why the out-of-memory is happening on this line. |
@allaffa *Although, I'm not sure if the math remains correct though. |
@allaffa
Screenshot of tensorboard: https://i.imgur.com/RK2AMBp.png |
@henrymai |
@allaffa I'll paste them here too for easy reference:
|
I'm asking @jqyin to double check how the normal equation behaves with the ResNet50 trained on ImageNet1k.
|
also, from the doc, the least square function in torch.linalg.lstsq actually uses different implementations on CPU and GPU: QR vs QR with pivoting. Maybe that causes the differences? unfortunately, there's only one implementation on GPU so we can't easily switch. |
@jqyin Also see this tracking issue that they are looking to migrate more things over to using cuBLAS/cuSolver: Also as I mentioned earlier, I would suggest reaching out to nvidia developer support (if you guys have a contract/relationship with them) to see if they can investigate and provide solutions/mitigations. nvidia actively contributes to pytorch, so they will probably be very helpful if you guys can reach out to them. |
This patch offloads AADL history to the cpu memory instead of using valuable gpu memory.
This incurs a performance hit of transferring the vectors to and from cpu memory, but allows for training
without reducing batch sizeswith a smaller reduction in batch size than without the patch and not run out of memory.This change also fixes a bug with
torch.nn.utils.convert_parameters.vector_to_parameters
where it does not preserve thememory_format
ofparam.data
.History device offload is configurable by the user so that they can continue to use gpu memory for history if they prefer that for some reason instead (by using
accelerate(..., history_device="cuda")
.For reference, I get the following error without cpu memory offload after about like 90 iterations:
With cpu memory offload I'm able to go 300+ iterations (at the same batch size as the failure scenario above).