Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃専 [FEATURE] Multi-system dataset support & atomic charges/dipoles prediction #89

Closed
leoil opened this issue Oct 11, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@leoil
Copy link

leoil commented Oct 11, 2021

Is your feature request related to a problem? Please describe.
The trained model shows great performance on dataset with single system like MD17. I'm wondering if it also support the training&fitting on dataset with multiple systems.

I removed npz_fixed_field_keys in the config file and run a quick test on the sn2-reaction dataset used in PhysNet, which includes various structures of different molecules related to that reactions. It uses 0 padding to represent those molecular with fewer atoms. Here is the result.

Training
# Epoch batch         loss       loss_f       loss_e     0_f_rmse     1_f_rmse     2_f_rmse     3_f_rmse     4_f_rmse     5_f_rmse     6_f_rmse   all_f_rmse      0_f_mae      1_f_mae      2_f_mae      3_f_mae      4_f_mae      5_f_mae      6_f_mae    all_f_mae        e_mae
    1     1          nan          nan          nan            0            0            0         1.42            0            0            0        0.203            0            0            0        0.711            0            0            0        0.102          nan
    1     2          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     3          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     4          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     5          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     6          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     7          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     8          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan

Also in order to model long range interactions, properties like atomic charges and dipoles is needed. I wonder if it's possible to implement this feature based on the existing code to calculate them.
I've tried to implement it myself, but not clear exactly which part of the code should be modified. :(

Describe the solution you'd like

  • Multi-system dataset support
  • Customized loss function, including energy, atomic forces/charges and dipole.

Additional context
atomic_number in sn2 dataset

>>> print(x['Z'][:10])
[[ 6  1  1  1  9 53]
 [53 53  0  0  0  0]
 [ 6  1  1  1 35  9]
 [ 6  1  1  1  9 53]
 [ 9  9  0  0  0  0]
 [ 6  1  1  1 17 53]
 [ 6  1  1  1 17 17]
 [ 6  1  1  1 35 17]
 [ 6  1  1  1 17 17]
 [ 6  1  1  1  9 53]]
@leoil leoil added the enhancement New feature or request label Oct 11, 2021
@simonbatzner
Copy link
Collaborator

Hi @leoil, thanks for the questions, answers to your two questions below.

  1. Yes the code does support that, I've trained on some of the PhysNet datasets myself and have gotten good results (I used the protein data set, which is also diverse across both chemical and structure space). I also trained on some of the SN2 reactions and saw very good results. Here's how you do it:
  • use the ase dataset input, it looks like this (see full.yaml)
# dataset: ase
dataset_file_name: xxx.xyz                                             
ase_args:                                                                      
  format: extxyz
  • the dataset_filename here is just one long extxyz file that contains all the structures one after another. In the PhysNet data these structures will have different length. If you've never worked with the extxyz file format before, you can construct that dataset_file_name from the PhysNet data like I do in this snippet (this is just pseudocode to get you started, you will have to adapt it to what their actual data and shapes are for each system in the PhysNet data to get around their 0-padding):
import numpy as np
 
from ase import Atoms
from ase.io import write
from ase.calculators.singlepoint import SinglePointCalculator
 
if __name__ == "__main__":
    # load physnet data set 
    data = np.load(...)

    # i will assume you have the following np.ndarrays
    positions = ...
    symbols = ...  
    forces = ...
    energies = ...

    for idx in range(len(data)):
        # their data set uses 0-padding so get the real number of atoms first (assume they are in the first N entries)
        n_real_atoms = data['N'][idx]
 
        # assumes you are working with non-periodic data, if you have periodic systems, add a cell argument and pbc=True
        atoms = Atoms(
            positions=positions[idx][:n_real_atoms],
            symbols=symbols[idx][:n_real_atoms],
            pbc=False
        )
 
        calculator = SinglePointCalculator(atoms, energy=energies[idx], forces=forces[idx][:n_real_atoms])
        atoms.calc = calculator
 
        write('./example-dataset.xyz' atoms, format='extxyz', append=True)
 

You can then simply put that extxyz file as the input to our code and everything should work. Note that if you want energies and forces, you will have to be careful on how to set the loss-function weights (we recommend using a weight of 1 on the energies and a weight of N^2 on the forces, where N is average number of atoms in the system). Let us know if you have any other questions.

  1. Yes, that is possible, we have some efforts working on this and it should be fairly straightforward to implement. In particular, if you just want to predict atomic charges or want to predict the dipole moment as they do in the PhysNet paper and then add it to the loss function, then that should be really easy since the atomic charges are just one additional scalar you're predicting. We're happy to support you on that, if you want to add it. @Linux-cpp-lisp can point you to the right part in the code on where to do this.

@leoil
Copy link
Author

leoil commented Oct 13, 2021

Thanks for your response.

  1. I modified your snippet to construct the sn2 extxyz dataset. However, when I run a fresh start with this dataset, an error occurs when building neibor_list for this frame.

    File "/home/miniconda3/envs/nequip/lib/python3.7/site-packages/nequip/data/AtomicData.py", line 513, in neighbor_list_and_relative_vec
        "After eliminating self edges, no edges remain in this system."
    ValueError: After eliminating self edges, no edges remain in this system.
    
    2
    Properties=species:S:1:pos:R:3:forces:R:3 energy=0.7943275145502244 pbc="F F F"
    I        6.70077400       5.99240300      -1.02271700       0.03244218       0.02902981      -0.00495554
    I       -6.70077400      -5.99240300       1.02271700      -0.03244270      -0.02902879       0.00495606
    

    I tried to set a larger r_max ( r_max = 20.0) , and everything works fine (the default cutoff in PhysNet is 10.0). However, when I set self_interaction = True , the results return 0 and nan again.

    Anyway, there are no such isolated atoms in my own system. I'm still doing some testes on a larger scale, and everything seems working well so far.

  2. In PhysNet, partial charges for each atom are predicted and dipoles are calculated accordingly. Minimizing dipole loss guarantee the accuracy of the partial charge prediction. And partial charge is not directly involved in the calculation of total loss.

    Is it feasible to do something similar (predicting partial charge within nn) in nequip , as it may involve further modifications besides the loss function part. Or, the partial charge itself should be included in the training set as part of the features(as mentioned in the docstring below). Do you have any ideas on how this would work?

    class AtomicData(Data)

    In general, node_features are features or input information on the nodes that will be fed through and transformed by the network, while node_attrs are encodings fixed, inherent attributes of the atoms themselves that remain constant through the network.

    For example, a one-hot encoding of atomic species is a node attribute, while some observed instantaneous property of that atom (current partial charge, for example), would be a feature.

@simonbatzner
Copy link
Collaborator

Hi @leoil, that sounds good! Regarding your questions:

  1. Looking at this system, it looks like the two atoms are more than 10A apart, so it should indeed throw that error. Note that in NequIP the central atom is not included in the convolution, so we don't use self_interaction=True, we are only interested in real neighbors. If you do set self_interaction=True, you will get a 0-distance, which will give uncontrolled behavior, which is likely the reason why you're seeing 0s (and from the gradient then nans) in your output. Keep it at self_interaction=False and make sure the cutoff is large enough so that all atoms see each other.

  2. Yes, that setup is possible and it works, I've used exactly this idea in the past. Note that there are many ways of getting electrostatics, but if you want to go the route of PhysNet, that is definitely possible. You would do exactly what you describe:

  • do not include the partial charge in the loss
  • predict the partial charge, you would get it from a node_feature: you can either do something like just grabbing the first element of the final feature vector or (what I would recommend instead) adding a second head to the network so that you have one output block that goes from the features after the final convolution to the energy and another output block that goes from the features after the final convolution to the partial charge.
  • compute the dipole moment from those partial charges
  • add the total dipole moment to the loss function

The networks should then implicitly learn the correct partial charges. Does that make sense? I highly recommend you do this on our develop branch, it will be a lot easier there and we are looking to merge this into main in the coming weeks. Let us know if you have specific questions on the implementation, we're happy to give you pointers.

@leoil
Copy link
Author

leoil commented Oct 15, 2021

Yes, that make sense. Thanks for your detailed guide, I'll give it a try.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants