Performance of the atomic neural networks in TorchANI #11

raimis · 2020-10-13T09:54:04Z

End-to-end performance benchmarks of ANI-2x

Molecule: 46 atoms (pytorch/molecules/2iuz_ligand.mol2)
GPU: GTX 1080 Ti

Forward & backward passes with complete ANI-2x:

TorchANI with original featurizer: 90 ms
TorchANI with our featurizer: 81 ms

Just forward pass with complete ANI-2x:

TorchANI with original featurizer: 25 ms
TorchANI with our featurizer: 23 ms

Forward & backward passes with ANI-2x using just one set of the atomic NNs, not 8:

TorchANI with original featurizer: 11 ms
TorchANI with our featurizer: 6.8 ms

Just forward pass with ANI-2x using just one set of the atomic NNs, not 8:

TorchANI with original featurizer: 6.3 ms
TorchANI with our featurizer: 3.7 ms

Originally posted by @raimis in #5 (comment)

The text was updated successfully, but these errors were encountered:

raimis · 2020-10-13T10:11:57Z

Answering to #5 (comment)

Looks like the neural net part is now the bottleneck. From the benchmarks in #6, doing both forward and backward passes through the features for a system of 60 atoms is only 0.115 ms, and for a system of 2269 atoms is 1.04 ms.

Do you have a sense of what makes the neural net part slow? Can we make it faster from within PyTorch, or do we need a custom kernel for that part too?

Also, in the above numbers, how much of the time is spent constructing and destructing CudaANISymmetryFunction objects, and how much is spent in the kernels?

The implementation of the atomic NN isn't optimal in TorchANI. For example, ANI-2x has 8 sets of atomic NNs, each NN set has 7 atomic NNs (for each element), each atomic NN is 3 layer fully-connected NN. The NN are computed sequentially, so a matrix multiplication kernel is executed 168 times (= 8 * 7 * 3) just in the forward pass. Using a batched matrix multiplication, it should be possible to reduce to 3 kernel executions.

After finishing #5, I'll try to make a batched PyTorch implementation. Ultimately, TensorRT should to be very good at that.

raimis · 2020-10-14T16:42:49Z

@peastman, just letting know before you start writing the NN part in CUDA directly.

I almost have a working implementation of the NN part using the batched matrix multiplications. I still have to fix a bug or two, but I can see a significant performance gain. I'll share the benchmarks soon.

peastman · 2020-10-14T17:00:33Z

Thanks! Looking forward to seeing it.

isayev · 2020-10-26T21:35:10Z

Dear @raimis these results look amazing! Just a note that 1x/2x hyperparameter optimization was done only with respect to the accuracy. We would be much looking for performance considerations and other constraints for the next iteration. Even current models could be re-trained and re-fitted if necessary.

raimis · 2020-10-28T10:10:49Z

@isayev In case of ANI-2x, for small molecules (~100 atoms), the bottleneck is the matrix multiplications in the dense layers. So, a single-model NNP (rather than the ensemble) would improve speed. For bigger molecules, the bottleneck becomes the neighbour search for the symmetry functions.

raimis mentioned this issue Oct 13, 2020

PyTorch wrapper #5

Merged

14 tasks

raimis mentioned this issue Oct 15, 2020

Batched NNs for TorchANI #13

Merged

4 tasks

raimis added the enhancement New feature or request label May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of the atomic neural networks in TorchANI #11

Performance of the atomic neural networks in TorchANI #11

raimis commented Oct 13, 2020

raimis commented Oct 13, 2020

raimis commented Oct 14, 2020

peastman commented Oct 14, 2020

isayev commented Oct 26, 2020

raimis commented Oct 28, 2020

Performance of the atomic neural networks in TorchANI #11

Performance of the atomic neural networks in TorchANI #11

Comments

raimis commented Oct 13, 2020

raimis commented Oct 13, 2020

raimis commented Oct 14, 2020

peastman commented Oct 14, 2020

isayev commented Oct 26, 2020

raimis commented Oct 28, 2020