Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❓ [QUESTION] Train models in single precision, but evaluate them in double precision #92

Closed
svandenhaute opened this issue Oct 25, 2021 · 8 comments
Labels
question Further information is requested

Comments

@svandenhaute
Copy link

Is it possible to train models in single precision, but deploy them in double precision? I'm trying to use my single-precision-trained models to compute some finite differences, and this is typically only possible when the output of the models uses double precision.

@svandenhaute svandenhaute added the question Further information is requested label Oct 25, 2021
@Linux-cpp-lisp
Copy link
Collaborator

Depends on what your goal is: if you just want float64 to come out instead of float32, definitely, but whether you will gain any actual improvement from this, I'm not so sure (since all the weights will be noise past single precision, though I guess you could avoid rounding errors in the intermediate states... I feel like that should only matter for pathological inputs/weights, but maybe these are).

Are you using a deployed model? In Python or C++? This will affect how to do it.

Side note: a deployed NequIP model is still a full-fledged PyTorch compute graph including arbitrary autograd support. Depending on which finite differences you are trying to compute, using the autograd engine may be a much better approach.

@svandenhaute
Copy link
Author

I'm trying to perform geometry optimizations using ASE. Initially it works well but after a while it gets stuck (energy no longer changes while fmax ~ 0.01 eV/A) and returns with an error that says "Gradient and/or function calls were not changing. May indicate that precision was lost, i.e., the routine did not converge.". I was using this CG optimizer from scipy, but also quasi-newton methods like BFGS seem to get stuck. However, even though it did not successfully complete optimization, the Hessian matrix already turned out to be positive definite so I'm not exactly sure whether there is in fact a problem. I was just curious about what would happen when the model is float64. Would something like model.double() work?

PS: You're right that it's noise, but if the precision is high, then the noise is also precisely constant and will get cancelled when taking finite differences. When the precision is lower, then that will no longer be the case. I've experienced a similar issue when dealing with classical force fields that are evaluated on a GPU; numerical estimates of e.g. the stress tensor were in my experience only possible when requiring double precision during the energy evaluation.

@Linux-cpp-lisp
Copy link
Collaborator

So unfortunately it's a little more involved then that: currently NequIP "freezes" the compiled model as part of deployment, which allows PyTorch to inline all weights, buffers, etc. and do any optimizations that enables. A side effect of freezing is that nothing is registered as a parameter/buffer anymore— everything is an inline constant in the TorchScript graph, which includes its dtype and device. device can be worked around here, since there is a map_device option for loading a TorchScript model, but there is no corresponding option for dtype. This means that .to() (and derived methods like .double()) become no-ops: see pytorch/pytorch#57569.

For now, the workaround is to not freeze models at deployment in your install. This can be acheived on main by simply commenting out this line: https://github.com/mir-group/nequip/blob/main/nequip/scripts/deploy.py#L142. (On develop, the change goes here: https://github.com/mir-group/nequip/blob/develop/nequip/scripts/deploy.py#L52.)

I plan to add some way to flag to NequIP to optionally not freeze the model during deployment, so this should have a more lasting fix, but not sure when that will land.

Also cc'ing @simonbatzner who may be interested in the geometry minimization part.

@svandenhaute
Copy link
Author

But what if you create a double precision model first? Something like model_double = model.double() before calling torch.jit.freeze()? Wouldn't that work?

@Linux-cpp-lisp
Copy link
Collaborator

Yes, that modification would also work— it's why I'm considering changing the way NequIP does this to freeze the loaded model right before using it, rather than before saving, to maintain this kind of flexibility... just trying to get some clarify from the PyTorch people first about backward compatability, etc.

@svandenhaute
Copy link
Author

I can confirm that this works! Simply adding model = model.double() before the torch.jit.freeze() call ensured the model was deployed in double precision. During inference, it's important not to forget to set torch.set_default_dtype(torch.float64), because otherwise the data loading inside NequIP's ASE calculator will still generate torch.float32 Tensors.
On a side note, this did indeed solve my optimization issues. The model no longer gets stuck in configurations that are not the minimum.

@Linux-cpp-lisp
Copy link
Collaborator

Interesting! Glad to hear this solved your problem.

Since it appears that this may be something that is useful in general for arbitrary potentials, I will try to see if there is a way to incorporate flexibility about dtype for deployment directly into nequip in the future.

@Linux-cpp-lisp
Copy link
Collaborator

Since it appears that this may be something that is useful in general for arbitrary potentials, I will try to see if there is a way to incorporate flexibility about dtype for deployment directly into nequip in the future.

FYI, this is now fixed--- models are now frozen right before they are used, and the dtype of a deployed model can be cast by doing something like:

import torch
model = torch.load("deployed.pth")
model = model.double()
torch.save("deployed-float64.pth", model)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants