❓ [QUESTION] Train models in single precision, but evaluate them in double precision #92

svandenhaute · 2021-10-25T12:43:38Z

Is it possible to train models in single precision, but deploy them in double precision? I'm trying to use my single-precision-trained models to compute some finite differences, and this is typically only possible when the output of the models uses double precision.

Linux-cpp-lisp · 2021-10-25T15:31:29Z

Depends on what your goal is: if you just want float64 to come out instead of float32, definitely, but whether you will gain any actual improvement from this, I'm not so sure (since all the weights will be noise past single precision, though I guess you could avoid rounding errors in the intermediate states... I feel like that should only matter for pathological inputs/weights, but maybe these are).

Are you using a deployed model? In Python or C++? This will affect how to do it.

Side note: a deployed NequIP model is still a full-fledged PyTorch compute graph including arbitrary autograd support. Depending on which finite differences you are trying to compute, using the autograd engine may be a much better approach.

svandenhaute · 2021-10-25T15:48:50Z

I'm trying to perform geometry optimizations using ASE. Initially it works well but after a while it gets stuck (energy no longer changes while fmax ~ 0.01 eV/A) and returns with an error that says "Gradient and/or function calls were not changing. May indicate that precision was lost, i.e., the routine did not converge.". I was using this CG optimizer from scipy, but also quasi-newton methods like BFGS seem to get stuck. However, even though it did not successfully complete optimization, the Hessian matrix already turned out to be positive definite so I'm not exactly sure whether there is in fact a problem. I was just curious about what would happen when the model is float64. Would something like model.double() work?

PS: You're right that it's noise, but if the precision is high, then the noise is also precisely constant and will get cancelled when taking finite differences. When the precision is lower, then that will no longer be the case. I've experienced a similar issue when dealing with classical force fields that are evaluated on a GPU; numerical estimates of e.g. the stress tensor were in my experience only possible when requiring double precision during the energy evaluation.

Linux-cpp-lisp · 2021-10-25T17:11:18Z

So unfortunately it's a little more involved then that: currently NequIP "freezes" the compiled model as part of deployment, which allows PyTorch to inline all weights, buffers, etc. and do any optimizations that enables. A side effect of freezing is that nothing is registered as a parameter/buffer anymore— everything is an inline constant in the TorchScript graph, which includes its dtype and device. device can be worked around here, since there is a map_device option for loading a TorchScript model, but there is no corresponding option for dtype. This means that .to() (and derived methods like .double()) become no-ops: see pytorch/pytorch#57569.

For now, the workaround is to not freeze models at deployment in your install. This can be acheived on main by simply commenting out this line: https://github.com/mir-group/nequip/blob/main/nequip/scripts/deploy.py#L142. (On develop, the change goes here: https://github.com/mir-group/nequip/blob/develop/nequip/scripts/deploy.py#L52.)

I plan to add some way to flag to NequIP to optionally not freeze the model during deployment, so this should have a more lasting fix, but not sure when that will land.

Also cc'ing @simonbatzner who may be interested in the geometry minimization part.

svandenhaute · 2021-10-25T17:24:53Z

But what if you create a double precision model first? Something like model_double = model.double() before calling torch.jit.freeze()? Wouldn't that work?

Linux-cpp-lisp · 2021-10-25T19:27:54Z

Yes, that modification would also work— it's why I'm considering changing the way NequIP does this to freeze the loaded model right before using it, rather than before saving, to maintain this kind of flexibility... just trying to get some clarify from the PyTorch people first about backward compatability, etc.

svandenhaute · 2021-10-26T06:53:51Z

I can confirm that this works! Simply adding model = model.double() before the torch.jit.freeze() call ensured the model was deployed in double precision. During inference, it's important not to forget to set torch.set_default_dtype(torch.float64), because otherwise the data loading inside NequIP's ASE calculator will still generate torch.float32 Tensors.
On a side note, this did indeed solve my optimization issues. The model no longer gets stuck in configurations that are not the minimum.

Linux-cpp-lisp · 2021-10-26T19:57:34Z

Interesting! Glad to hear this solved your problem.

Since it appears that this may be something that is useful in general for arbitrary potentials, I will try to see if there is a way to incorporate flexibility about dtype for deployment directly into nequip in the future.

Linux-cpp-lisp · 2022-07-14T10:14:49Z

Since it appears that this may be something that is useful in general for arbitrary potentials, I will try to see if there is a way to incorporate flexibility about dtype for deployment directly into nequip in the future.

FYI, this is now fixed--- models are now frozen right before they are used, and the dtype of a deployed model can be cast by doing something like:

import torch
model = torch.load("deployed.pth")
model = model.double()
torch.save("deployed-float64.pth", model)

svandenhaute added the question Further information is requested label Oct 25, 2021

simonbatzner closed this as completed Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓ [QUESTION] Train models in single precision, but evaluate them in double precision #92

❓ [QUESTION] Train models in single precision, but evaluate them in double precision #92

svandenhaute commented Oct 25, 2021

Linux-cpp-lisp commented Oct 25, 2021

svandenhaute commented Oct 25, 2021

Linux-cpp-lisp commented Oct 25, 2021

svandenhaute commented Oct 25, 2021

Linux-cpp-lisp commented Oct 25, 2021

svandenhaute commented Oct 26, 2021

Linux-cpp-lisp commented Oct 26, 2021

Linux-cpp-lisp commented Jul 14, 2022

❓ [QUESTION] Train models in single precision, but evaluate them in double precision #92

❓ [QUESTION] Train models in single precision, but evaluate them in double precision #92

Comments

svandenhaute commented Oct 25, 2021

Linux-cpp-lisp commented Oct 25, 2021

svandenhaute commented Oct 25, 2021

Linux-cpp-lisp commented Oct 25, 2021

svandenhaute commented Oct 25, 2021

Linux-cpp-lisp commented Oct 25, 2021

svandenhaute commented Oct 26, 2021

Linux-cpp-lisp commented Oct 26, 2021

Linux-cpp-lisp commented Jul 14, 2022