-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with KFAC on multiple GPUs #26
Comments
I don't have much to say about the cuSolver issue at the moment, but I will
recommend against using FermiNet on TPU for now. It is significantly slower
than on GPU, largely due to issues with matrix inversion and LU
decomposition, neither of which TPUs are really designed for (they are
mostly good for dense matrix multiplication and elementwise operations, and
nothing more). So unless you want to spend some time optimizing the XLA:TPU
implementation of lu_solve, I would steer clear of TPUs right now.
…On Mon, May 10, 2021 at 3:05 PM connection-on-fiber-bundles < ***@***.***> wrote:
Hi there. Thanks again for the awesome open-sourcing work of the KFAC
optimizer!
However, as I mentioned in #24 (comment)
<#24 (comment)>
several weeks ago, we hit some issues when running optimization with KFAC.
As suggested by @jsspencer <https://github.com/jsspencer>, it may be the
issue of some noisy matrix to be inverted. However, we tried a quite large
batch size (40960 for Mg) and large damping factor (1 as opposed to the
default 0.001), but neither fixes the issue.
Recently, we notice that the cusolver issue only shows up when we optimize
ferminet with multiple GPUs (in our case 8 V100 cards). And if we run the
same command for Mg on a single GPU (even with small batch size like
256), it does not hit the same cusolver issue.
Unfortunately, we were failed to spot where the cusolver issue really
happens. In fact, we tried to debug with the KFAC optimizer's debug
option turned on (so that no jit or pmap happens), but KFAC optimizer
doesn't work at all with debug turned on and it seems not trivial to
really fix it (maybe we didn't try hard enough). We think the problematic
inversion might happen at
https://github.com/deepmind/deepmind-research/blob/master/kfac_ferminet_alpha/utils.py#L131,
but even if we simply replace the to-invert matrix by an identity matrix,
the issue persists in the multiple-GPU environment.
Since the optimization works in a single-GPU environment but failed in a
multiple-GPU one, we suspect something wrong when pmap meets cusolver,
but didn't know how to dig deeper. Thoughts?
BTW, do you guys do all the development and testing on TPU instead GPU? If
so, we might also run our experiments on TPUs if KFAC works there. Any
gotchas when running JAX on TPU? Thanks!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#26>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDACBBQYGOO5KDIF3Y5YLTM7RZPANCNFSM44RJCNBA>
.
|
Wow, thanks a lot for such a prompt response! Got it, no TPU for now. So you guys did development and testing on GPUs as well ? Just to confirm, you didn't hit similar issues when optimizing Ferminet on atoms like Na and Mg, right? Which cuda version were you using? |
We were able to optimize all second row atoms using `pmap` across 8 GPUs,
no matter what CUDA version we were using. I think we were using CUDA 10
but I'm not sure.
…On Mon, May 10, 2021 at 3:16 PM connection-on-fiber-bundles < ***@***.***> wrote:
Wow, thanks a lot for such a prompt response!
Got it, no TPU for now. So you guys did development and testing on GPUs as
well ? Just to confirm, you didn't hit similar issues when optimizing
Ferminet on atoms like Na and Mg, right? Which cuda version were you using?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#26 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDACEMRKSNTPLAANOF6P3TM7TE5ANCNFSM44RJCNBA>
.
|
Right, we can also successfully train ferminet with KFAC for atoms like C, O and F with 8 GPUs. However, we hit the cusolver issue when we move to Na and Mg, not sure why. We might also give cuda 10 a try (we were using cuda 11) |
Na and Mg are second row.
…On Mon, May 10, 2021 at 4:13 PM connection-on-fiber-bundles < ***@***.***> wrote:
Right, we can also successfully train ferminet with KFAC for atoms like C,
O and F with 8 GPUs. However, we hit the cusolver issue when we move to Na
and Mg, not sure why. We might also give cuda 10 a try (we were using cuda
11)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#26 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDACA2IXYHYIQKSM52QY3TM7ZZVANCNFSM44RJCNBA>
.
|
sorry I thought the second row is from Li to Ne. So you mean you have done experiments with elements from Na to Ar? |
Yes. Results for P to Ar are in our NeurIPS 2020 workshop paper (https://arxiv.org/abs/2011.07125). We have some calculations using CUDA 11.3 but most use 10.3 (including, I believe, all those in the workshop paper). |
@jsspencer Got it, thanks for the info! I run some experiments with cuda 10.3 today and they, including the previously failed ones, all run smoothly! I suspect there's some breaking change introduced in cuda 11 but not sure how to dig further. I will close this ticket for now. (Also thank @dpfau for all the help!) Thanks again for the awesome work! |
Hi there. Thanks again for the awesome open-sourcing work of the KFAC optimizer!
However, as I mentioned in #24 (comment) several weeks ago, we hit some issues when running optimization with KFAC.
As suggested by @jsspencer, it may be the issue of some noisy matrix to be inverted. However, we tried a quite large batch size (40960 for
Mg
) and large damping factor (1 as opposed to the default 0.001), but neither fixes the issue.Recently, we notice that the cusolver issue only shows up when we optimize ferminet with multiple GPUs (in our case 8 V100 cards). And if we run the same command for
Mg
on a single GPU (even with small batch size like 256), it does not hit the same cusolver issue.Unfortunately, we were failed to spot where the cusolver issue really happens. In fact, we tried to debug with the KFAC optimizer's
debug
option turned on (so that no jit or pmap happens), but KFAC optimizer doesn't work at all withdebug
turned on and it seems not trivial to really fix it (maybe we didn't try hard enough). We think the problematic inversion might happen at https://github.com/deepmind/deepmind-research/blob/master/kfac_ferminet_alpha/utils.py#L131, but even if we simply replace the to-invert matrix by an identity matrix, the issue persists in the multiple-GPU environment.Since the optimization works in a single-GPU environment but failed in a multiple-GPU one, we suspect something wrong when
pmap
meets cusolver, but didn't know how to dig deeper. Thoughts?BTW, do you guys do all the development and testing on TPU instead GPU? If so, we might also run our experiments on TPUs if KFAC works there. Any gotchas when running JAX on TPU? Thanks!
The text was updated successfully, but these errors were encountered: