Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of the atomic neural networks in TorchANI #11

Open
raimis opened this issue Oct 13, 2020 · 5 comments
Open

Performance of the atomic neural networks in TorchANI #11

raimis opened this issue Oct 13, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@raimis
Copy link
Contributor

raimis commented Oct 13, 2020

End-to-end performance benchmarks of ANI-2x

Molecule: 46 atoms (pytorch/molecules/2iuz_ligand.mol2)
GPU: GTX 1080 Ti

Forward & backward passes with complete ANI-2x:

  • TorchANI with original featurizer: 90 ms
  • TorchANI with our featurizer: 81 ms

Just forward pass with complete ANI-2x:

  • TorchANI with original featurizer: 25 ms
  • TorchANI with our featurizer: 23 ms

Forward & backward passes with ANI-2x using just one set of the atomic NNs, not 8:

  • TorchANI with original featurizer: 11 ms
  • TorchANI with our featurizer: 6.8 ms

Just forward pass with ANI-2x using just one set of the atomic NNs, not 8:

  • TorchANI with original featurizer: 6.3 ms
  • TorchANI with our featurizer: 3.7 ms

Originally posted by @raimis in #5 (comment)

@raimis raimis mentioned this issue Oct 13, 2020
14 tasks
@raimis
Copy link
Contributor Author

raimis commented Oct 13, 2020

Answering to #5 (comment)

Looks like the neural net part is now the bottleneck. From the benchmarks in #6, doing both forward and backward passes through the features for a system of 60 atoms is only 0.115 ms, and for a system of 2269 atoms is 1.04 ms.

Do you have a sense of what makes the neural net part slow? Can we make it faster from within PyTorch, or do we need a custom kernel for that part too?

Also, in the above numbers, how much of the time is spent constructing and destructing CudaANISymmetryFunction objects, and how much is spent in the kernels?

The implementation of the atomic NN isn't optimal in TorchANI. For example, ANI-2x has 8 sets of atomic NNs, each NN set has 7 atomic NNs (for each element), each atomic NN is 3 layer fully-connected NN. The NN are computed sequentially, so a matrix multiplication kernel is executed 168 times (= 8 * 7 * 3) just in the forward pass. Using a batched matrix multiplication, it should be possible to reduce to 3 kernel executions.

After finishing #5, I'll try to make a batched PyTorch implementation. Ultimately, TensorRT should to be very good at that.

@raimis
Copy link
Contributor Author

raimis commented Oct 14, 2020

@peastman, just letting know before you start writing the NN part in CUDA directly.

I almost have a working implementation of the NN part using the batched matrix multiplications. I still have to fix a bug or two, but I can see a significant performance gain. I'll share the benchmarks soon.

@peastman
Copy link
Member

Thanks! Looking forward to seeing it.

@raimis raimis mentioned this issue Oct 15, 2020
4 tasks
@isayev
Copy link

isayev commented Oct 26, 2020

Dear @raimis these results look amazing! Just a note that 1x/2x hyperparameter optimization was done only with respect to the accuracy. We would be much looking for performance considerations and other constraints for the next iteration. Even current models could be re-trained and re-fitted if necessary.

@raimis
Copy link
Contributor Author

raimis commented Oct 28, 2020

@isayev In case of ANI-2x, for small molecules (~100 atoms), the bottleneck is the matrix multiplications in the dense layers. So, a single-model NNP (rather than the ensemble) would improve speed. For bigger molecules, the bottleneck becomes the neighbour search for the symmetry functions.

@raimis raimis added the enhancement New feature or request label May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants