Issues with KFAC on multiple GPUs #26

connection-on-fiber-bundles · 2021-05-10T14:04:47Z

Hi there. Thanks again for the awesome open-sourcing work of the KFAC optimizer!

However, as I mentioned in #24 (comment) several weeks ago, we hit some issues when running optimization with KFAC.

As suggested by @jsspencer, it may be the issue of some noisy matrix to be inverted. However, we tried a quite large batch size (40960 for Mg) and large damping factor (1 as opposed to the default 0.001), but neither fixes the issue.

Recently, we notice that the cusolver issue only shows up when we optimize ferminet with multiple GPUs (in our case 8 V100 cards). And if we run the same command for Mg on a single GPU (even with small batch size like 256), it does not hit the same cusolver issue.

Unfortunately, we were failed to spot where the cusolver issue really happens. In fact, we tried to debug with the KFAC optimizer's debug option turned on (so that no jit or pmap happens), but KFAC optimizer doesn't work at all with debug turned on and it seems not trivial to really fix it (maybe we didn't try hard enough). We think the problematic inversion might happen at https://github.com/deepmind/deepmind-research/blob/master/kfac_ferminet_alpha/utils.py#L131, but even if we simply replace the to-invert matrix by an identity matrix, the issue persists in the multiple-GPU environment.

Since the optimization works in a single-GPU environment but failed in a multiple-GPU one, we suspect something wrong when pmap meets cusolver, but didn't know how to dig deeper. Thoughts?

BTW, do you guys do all the development and testing on TPU instead GPU? If so, we might also run our experiments on TPUs if KFAC works there. Any gotchas when running JAX on TPU? Thanks!

The text was updated successfully, but these errors were encountered:

dpfau · 2021-05-10T14:09:57Z

I don't have much to say about the cuSolver issue at the moment, but I will recommend against using FermiNet on TPU for now. It is significantly slower than on GPU, largely due to issues with matrix inversion and LU decomposition, neither of which TPUs are really designed for (they are mostly good for dense matrix multiplication and elementwise operations, and nothing more). So unless you want to spend some time optimizing the XLA:TPU implementation of lu_solve, I would steer clear of TPUs right now.

…

On Mon, May 10, 2021 at 3:05 PM connection-on-fiber-bundles < ***@***.***> wrote: Hi there. Thanks again for the awesome open-sourcing work of the KFAC optimizer! However, as I mentioned in #24 (comment) <#24 (comment)> several weeks ago, we hit some issues when running optimization with KFAC. As suggested by @jsspencer <https://github.com/jsspencer>, it may be the issue of some noisy matrix to be inverted. However, we tried a quite large batch size (40960 for Mg) and large damping factor (1 as opposed to the default 0.001), but neither fixes the issue. Recently, we notice that the cusolver issue only shows up when we optimize ferminet with multiple GPUs (in our case 8 V100 cards). And if we run the same command for Mg on a single GPU (even with small batch size like 256), it does not hit the same cusolver issue. Unfortunately, we were failed to spot where the cusolver issue really happens. In fact, we tried to debug with the KFAC optimizer's debug option turned on (so that no jit or pmap happens), but KFAC optimizer doesn't work at all with debug turned on and it seems not trivial to really fix it (maybe we didn't try hard enough). We think the problematic inversion might happen at https://github.com/deepmind/deepmind-research/blob/master/kfac_ferminet_alpha/utils.py#L131, but even if we simply replace the to-invert matrix by an identity matrix, the issue persists in the multiple-GPU environment. Since the optimization works in a single-GPU environment but failed in a multiple-GPU one, we suspect something wrong when pmap meets cusolver, but didn't know how to dig deeper. Thoughts? BTW, do you guys do all the development and testing on TPU instead GPU? If so, we might also run our experiments on TPUs if KFAC works there. Any gotchas when running JAX on TPU? Thanks! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDACBBQYGOO5KDIF3Y5YLTM7RZPANCNFSM44RJCNBA> .

connection-on-fiber-bundles · 2021-05-10T14:16:20Z

Wow, thanks a lot for such a prompt response!

Got it, no TPU for now. So you guys did development and testing on GPUs as well ? Just to confirm, you didn't hit similar issues when optimizing Ferminet on atoms like Na and Mg, right? Which cuda version were you using?

dpfau · 2021-05-10T14:22:56Z

We were able to optimize all second row atoms using `pmap` across 8 GPUs, no matter what CUDA version we were using. I think we were using CUDA 10 but I'm not sure.

…

On Mon, May 10, 2021 at 3:16 PM connection-on-fiber-bundles < ***@***.***> wrote: Wow, thanks a lot for such a prompt response! Got it, no TPU for now. So you guys did development and testing on GPUs as well ? Just to confirm, you didn't hit similar issues when optimizing Ferminet on atoms like Na and Mg, right? Which cuda version were you using? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDACEMRKSNTPLAANOF6P3TM7TE5ANCNFSM44RJCNBA> .

connection-on-fiber-bundles · 2021-05-10T15:13:04Z

Right, we can also successfully train ferminet with KFAC for atoms like C, O and F with 8 GPUs. However, we hit the cusolver issue when we move to Na and Mg, not sure why. We might also give cuda 10 a try (we were using cuda 11)

dpfau · 2021-05-10T15:26:41Z

Na and Mg are second row.

…

On Mon, May 10, 2021 at 4:13 PM connection-on-fiber-bundles < ***@***.***> wrote: Right, we can also successfully train ferminet with KFAC for atoms like C, O and F with 8 GPUs. However, we hit the cusolver issue when we move to Na and Mg, not sure why. We might also give cuda 10 a try (we were using cuda 11) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDACA2IXYHYIQKSM52QY3TM7ZZVANCNFSM44RJCNBA> .

connection-on-fiber-bundles · 2021-05-11T02:34:37Z

Na and Mg are second row.
…
On Mon, May 10, 2021 at 4:13 PM connection-on-fiber-bundles < @.***> wrote: Right, we can also successfully train ferminet with KFAC for atoms like C, O and F with 8 GPUs. However, we hit the cusolver issue when we move to Na and Mg, not sure why. We might also give cuda 10 a try (we were using cuda 11) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDACA2IXYHYIQKSM52QY3TM7ZZVANCNFSM44RJCNBA .

sorry I thought the second row is from Li to Ne. So you mean you have done experiments with elements from Na to Ar?

jsspencer · 2021-05-11T07:56:15Z

Yes. Results for P to Ar are in our NeurIPS 2020 workshop paper (https://arxiv.org/abs/2011.07125). We have some calculations using CUDA 11.3 but most use 10.3 (including, I believe, all those in the workshop paper).

connection-on-fiber-bundles · 2021-05-11T12:27:00Z

@jsspencer Got it, thanks for the info！

I run some experiments with cuda 10.3 today and they, including the previously failed ones, all run smoothly! I suspect there's some breaking change introduced in cuda 11 but not sure how to dig further. I will close this ticket for now. (Also thank @dpfau for all the help!)

Thanks again for the awesome work!

connection-on-fiber-bundles closed this as completed May 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with KFAC on multiple GPUs #26

Issues with KFAC on multiple GPUs #26

connection-on-fiber-bundles commented May 10, 2021

dpfau commented May 10, 2021 via email

connection-on-fiber-bundles commented May 10, 2021

dpfau commented May 10, 2021 via email

connection-on-fiber-bundles commented May 10, 2021

dpfau commented May 10, 2021 via email

connection-on-fiber-bundles commented May 11, 2021

jsspencer commented May 11, 2021

connection-on-fiber-bundles commented May 11, 2021

Issues with KFAC on multiple GPUs #26

Issues with KFAC on multiple GPUs #26

Comments

connection-on-fiber-bundles commented May 10, 2021

dpfau commented May 10, 2021 via email

connection-on-fiber-bundles commented May 10, 2021

dpfau commented May 10, 2021 via email

connection-on-fiber-bundles commented May 10, 2021

dpfau commented May 10, 2021 via email

connection-on-fiber-bundles commented May 11, 2021

jsspencer commented May 11, 2021

connection-on-fiber-bundles commented May 11, 2021