Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with KFAC on multiple GPUs #26

Closed
connection-on-fiber-bundles opened this issue May 10, 2021 · 8 comments
Closed

Issues with KFAC on multiple GPUs #26

connection-on-fiber-bundles opened this issue May 10, 2021 · 8 comments

Comments

@connection-on-fiber-bundles

Hi there. Thanks again for the awesome open-sourcing work of the KFAC optimizer!

However, as I mentioned in #24 (comment) several weeks ago, we hit some issues when running optimization with KFAC.

As suggested by @jsspencer, it may be the issue of some noisy matrix to be inverted. However, we tried a quite large batch size (40960 for Mg) and large damping factor (1 as opposed to the default 0.001), but neither fixes the issue.

Recently, we notice that the cusolver issue only shows up when we optimize ferminet with multiple GPUs (in our case 8 V100 cards). And if we run the same command for Mg on a single GPU (even with small batch size like 256), it does not hit the same cusolver issue.

Unfortunately, we were failed to spot where the cusolver issue really happens. In fact, we tried to debug with the KFAC optimizer's debug option turned on (so that no jit or pmap happens), but KFAC optimizer doesn't work at all with debug turned on and it seems not trivial to really fix it (maybe we didn't try hard enough). We think the problematic inversion might happen at https://github.com/deepmind/deepmind-research/blob/master/kfac_ferminet_alpha/utils.py#L131, but even if we simply replace the to-invert matrix by an identity matrix, the issue persists in the multiple-GPU environment.

Since the optimization works in a single-GPU environment but failed in a multiple-GPU one, we suspect something wrong when pmap meets cusolver, but didn't know how to dig deeper. Thoughts?

BTW, do you guys do all the development and testing on TPU instead GPU? If so, we might also run our experiments on TPUs if KFAC works there. Any gotchas when running JAX on TPU? Thanks!

@dpfau
Copy link
Collaborator

dpfau commented May 10, 2021 via email

@connection-on-fiber-bundles
Copy link
Author

Wow, thanks a lot for such a prompt response!

Got it, no TPU for now. So you guys did development and testing on GPUs as well ? Just to confirm, you didn't hit similar issues when optimizing Ferminet on atoms like Na and Mg, right? Which cuda version were you using?

@dpfau
Copy link
Collaborator

dpfau commented May 10, 2021 via email

@connection-on-fiber-bundles
Copy link
Author

Right, we can also successfully train ferminet with KFAC for atoms like C, O and F with 8 GPUs. However, we hit the cusolver issue when we move to Na and Mg, not sure why. We might also give cuda 10 a try (we were using cuda 11)

@dpfau
Copy link
Collaborator

dpfau commented May 10, 2021 via email

@connection-on-fiber-bundles
Copy link
Author

Na and Mg are second row.

On Mon, May 10, 2021 at 4:13 PM connection-on-fiber-bundles < @.***> wrote: Right, we can also successfully train ferminet with KFAC for atoms like C, O and F with 8 GPUs. However, we hit the cusolver issue when we move to Na and Mg, not sure why. We might also give cuda 10 a try (we were using cuda 11) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDACA2IXYHYIQKSM52QY3TM7ZZVANCNFSM44RJCNBA .

sorry I thought the second row is from Li to Ne. So you mean you have done experiments with elements from Na to Ar?

@jsspencer
Copy link
Collaborator

Yes. Results for P to Ar are in our NeurIPS 2020 workshop paper (https://arxiv.org/abs/2011.07125). We have some calculations using CUDA 11.3 but most use 10.3 (including, I believe, all those in the workshop paper).

@connection-on-fiber-bundles
Copy link
Author

@jsspencer Got it, thanks for the info!

I run some experiments with cuda 10.3 today and they, including the previously failed ones, all run smoothly! I suspect there's some breaking change introduced in cuda 11 but not sure how to dig further. I will close this ticket for now. (Also thank @dpfau for all the help!)

Thanks again for the awesome work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants