Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A General cclib-based Taskdoc #64

Merged
merged 17 commits into from
Jan 31, 2022
Merged

A General cclib-based Taskdoc #64

merged 17 commits into from
Jan 31, 2022

Conversation

Andrew-S-Rosen
Copy link
Member

@Andrew-S-Rosen Andrew-S-Rosen commented Jan 27, 2022

This is a PR for a general cclib-based taskdoc based on #59.

Some nice things about this include:

  • Instant support for output parsing of virtually all popular molecular DFT codes out-of-the-box
  • Offloading the work of updating parsers for new codes and new code versions to cclib, especially for codes without existing infrastructure in pymatgen
  • Consistent TaskDocument attributes between molecular DFT codes makes it easy to write interoperable queries.
  • cclib supports a large number of population analysis methods (see here). So, like the bader_caller in Pymatgen for VASP, we now have the option to carry out many different (quick) post-processing analyses to add to the TaskDocument following a DFT run.

@utf: Let me know if you have any initial suggestions at this stage.

Here is what an example taskdoc for a Gaussian calculation on O2 looks like:

from atomate2.common.schemas import cclib
# Read the log file
doc = cclib.TaskDocument.from_logfile(".", ".log")
print(doc.dict())
{'nsites': 2,
 'elements': [Element O],
 'nelements': 1,
 'composition': Comp: O2,
 'composition_reduced': Comp: O2,
 'formula_pretty': 'O2',
 'formula_anonymous': 'A',
 'chemsys': 'O',
 'point_group': 'D*h',
 'charge': 0,
 'spin_multiplicity': 3,
 'nelectrons': 16,
 'molecule': Molecule Summary
 Site: O (0.3974, 0.0000, 0.0000)
 Site: O (1.6026, 0.0000, 0.0000),
 'dir_name': 'LAPTOP-ROFVQCUO.attlocal.net:C:\\Users\\asros\\Desktop',
 'logfile': 'LAPTOP-ROFVQCUO.attlocal.net:C:\\Users\\asros\\Desktop\\gau_testopt.log.gz',

 'attributes': {'atomcharges': {'mulliken': array([ 1.e-06, -1.e-06]),
   'mulliken_sum': array([ 1.e-06, -1.e-06]),
   'apt': array([ 1.e-06, -1.e-06]),
   'apt_sum': array([ 1.e-06, -1.e-06])},
  'atommasses': array([15.9949146, 15.9949146, 15.9949146, 15.9949146]),
  'atomspins': {'mulliken': array([1.000001, 0.999999]),
   'mulliken_sum': array([1.000001, 0.999999])},
  'coreelectrons': array([0, 0], dtype=int32),
  'enthalpy': -150.36247,
  'entropy': 7.805131645143065e-05,
  'freeenergy': -150.385741,
  'geotargets': array([0.00045, 0.0003 , 0.0018 , 0.0012 ]),
  'geovalues': array([[1.19958e-01, 1.19958e-01, 1.50000e-01, 2.12132e-01],
         [1.42971e-01, 1.42971e-01, 2.52269e-01, 3.56762e-01],
         [1.74029e-01, 1.74029e-01, 4.24264e-01, 6.00000e-01],
         [1.65732e-01, 1.65732e-01, 7.82070e-02, 1.10601e-01],
         [4.22300e-03, 4.22300e-03, 2.61700e-03, 3.70200e-03],
         [4.10000e-05, 4.10000e-05, 2.50000e-05, 3.50000e-05],
         [4.00000e-05, 4.00000e-05, 2.50000e-05, 3.50000e-05]]),
  'grads': array([[[ 1.19957536e-01,  0.00000000e+00, -0.00000000e+00],
          [-1.19957536e-01, -0.00000000e+00,  0.00000000e+00]],

         [[ 1.42971227e-01,  0.00000000e+00,  0.00000000e+00],
          [-1.42971227e-01, -0.00000000e+00, -0.00000000e+00]],

         [[ 1.74028967e-01, -0.00000000e+00, -0.00000000e+00],
          [-1.74028967e-01,  0.00000000e+00,  0.00000000e+00]],

         [[-1.65731529e-01,  0.00000000e+00,  0.00000000e+00],
          [ 1.65731529e-01, -0.00000000e+00, -0.00000000e+00]],

         [[ 4.22343100e-03,  0.00000000e+00, -0.00000000e+00],
          [-4.22343100e-03, -0.00000000e+00,  0.00000000e+00]],

         [[-4.05590000e-05, -0.00000000e+00,  0.00000000e+00],
          [ 4.05590000e-05,  0.00000000e+00, -0.00000000e+00]],

         [[-4.04980000e-05,  0.00000000e+00, -0.00000000e+00],
          [ 4.04980000e-05, -0.00000000e+00,  0.00000000e+00]]]),
  'homos': array([8, 6], dtype=int32),
  'moenergies': [array([-520.20249703, -520.20059223,  -34.63138553,  -22.00421441,

           -14.00706045,  -14.00706045,  -13.97712793,   -7.05400735,
            -7.05400735,    4.23844534,    9.3065658 ,   10.94197004,
            10.94197004,   11.6581737 ,   12.55859843,   13.72569473,
            13.72569473,   19.65287862,   32.54209538,   32.5423675 ,
            34.85778425,   34.85778425,   43.46175209,   43.46175209,
            50.14786151,   57.14445283,   57.14445283,   57.56677353,
            65.76529173,   65.76529173,   69.32780626,   75.19811836,
            75.19811836,   75.76493151,   78.92226852,  117.07018133,
           126.27443232,  126.27443232,  127.07145379,  127.07145379,
           135.52956861,  135.52956861,  136.69421589,  142.3427552 ,
           142.3427552 ,  157.07608752,  157.07663175,  165.24249629,
           165.24331263,  171.90955774,  171.91119042,  172.34684469,
           172.34684469,  174.80593756,  175.5651352 ,  175.5651352 ,
           200.0230002 ,  200.0230002 ,  206.00487898,  239.22100037,
          1174.27412394, 1188.86105901]),
   array([-519.37363824, -519.37200555,  -32.85584265,  -19.56525796,
           -12.64703543,  -11.61844507,  -11.61844507,   -3.94238547,
            -3.94238547,    5.39520131,    9.33867524,   11.40184245,
            11.40184245,   12.16430546,   12.28430767,   14.3458422 ,
            14.3458422 ,   20.25887617,   34.21341865,   34.21341865,
            35.67467003,   35.67467003,   45.73880079,   45.73880079,
            51.44747726,   58.36733248,   58.71971991,   58.71971991,
            68.59309886,   68.59309886,   71.2932846 ,   77.83599002,
            77.83599002,   77.91327036,   79.86051707,  118.57388247,
           128.06548569,  128.06548569,  128.46413248,  128.46413248,
           138.85452775,  138.85452775,  139.27058982,  145.83234322,
           145.83234322,  159.88702359,  159.88702359,  169.01834807,
           169.01862019,  174.79069919,  174.79069919,  175.39588039,
           175.3961525 ,  176.48923384,  178.63049773,  178.63049773,
           202.34467557,  202.34467557,  207.15020617,  241.02157771,
          1174.30650549, 1188.86949454])],
  'moments': [array([0., 0., 0.]),
   array([-0., -0., -0.]),
   array([-10.3365,  -0.    ,  -0.    ,  -9.838 ,  -0.    ,  -9.838 ]),
   array([-31.0095,  -0.    ,  -0.    ,  -9.838 ,   0.    ,  -9.838 ,
           -0.    ,  -0.    ,  -0.    ,  -0.    ]),
   array([-90.0359,   0.    ,  -0.    , -15.9012,   0.    , -15.9012,
           -0.    ,  -0.    ,  -6.8379,  -0.    ,  -2.2793,  -0.    ,
           -0.    ,  -0.    ,  -6.8379])],
  'nbasis': 62,
  'nmo': 62,
  'optdone': True,
  'optstatus': array([1, 0, 0, 0, 0, 4, 5], dtype=int32),
  'polarizabilities': [array([[14.054,  0.   , -0.   ],
          [ 0.   ,  5.398,  0.   ],
          [-0.   ,  0.   ,  5.398]]),
   array([[ 1.40544283e+01,  3.35092331e-12, -4.02917938e-12],
          [ 3.35092331e-12,  5.39771016e+00,  7.14623265e-13],
          [-4.02917938e-12,  7.14623265e-13,  5.39771016e+00]])],
  'pressure': 1.0,
  'scfenergies': array([-4085.96814201, -4087.04106668, -4089.24542248, -4091.45285088,

         -4091.76297765, -4091.76327656, -4091.76327656]),
  'scftargets': array([[1.e-08, 1.e-06, 1.e-06],
         [1.e-08, 1.e-06, 1.e-06],
         [1.e-08, 1.e-06, 1.e-06],
         [1.e-08, 1.e-06, 1.e-06],
         [1.e-08, 1.e-06, 1.e-06],
         [1.e-08, 1.e-06, 1.e-06],
         [1.e-08, 1.e-06, 1.e-06]]),
  'scfvalues': [array([[ 2.47e-03,  3.67e-02,       nan],
          [ 1.35e-03,  2.07e-02, -5.01e-03],
          [ 1.91e-04,  2.83e-03, -1.25e-02],
          [ 4.12e-05,  7.97e-04, -1.82e-04],
          [ 4.96e-06,  9.66e-05, -6.06e-06],
          [ 4.96e-06,  9.66e-05, -4.12e-05],
          [ 1.82e-06,  4.63e-05, -2.57e-07],
          [ 2.57e-06,  5.40e-05, -3.09e-09],
          [ 2.48e-06,  4.87e-05,  1.10e-07],
          [ 1.65e-07,  3.49e-06, -1.13e-07],
          [ 1.64e-08,  1.54e-07, -5.13e-10],
          [ 2.06e-09,  3.87e-08, -3.24e-12]]),
   array([[ 5.84e-04,  7.97e-03,       nan],
          [ 1.25e-04,  2.00e-03, -1.20e-03],
          [ 4.70e-05,  7.31e-04, -3.72e-05],
          [ 3.68e-06,  4.14e-05, -2.04e-05],
          [ 3.68e-06,  4.14e-05,  1.39e-05],
          [ 3.68e-06,  7.48e-05, -2.16e-07],
          [ 5.39e-06,  1.06e-04,  1.80e-07],
          [ 2.62e-06,  5.35e-05, -7.00e-08],
          [ 2.01e-07,  3.60e-06, -1.22e-07],
          [ 2.46e-08,  3.05e-07, -6.96e-10],
          [ 3.17e-09,  3.95e-08, -6.20e-12]]),
   array([[ 2.70e-03,  3.39e-02,       nan],
          [ 1.42e-03,  1.92e-02,  9.46e-03],
          [ 3.37e-04,  4.96e-03, -1.80e-02],
          [ 1.04e-04,  1.75e-03, -4.11e-04],
          [ 8.73e-06,  1.42e-04, -9.25e-05],
          [ 8.73e-06,  1.42e-04, -6.73e-05],
          [ 1.43e-06,  1.74e-05, -6.96e-08],
          [ 2.61e-06,  5.02e-05, -3.63e-09],
          [ 2.02e-06,  3.89e-05,  7.26e-08],
          [ 7.39e-08,  1.07e-06, -8.09e-08],
          [ 1.94e-08,  3.19e-07, -2.99e-11],
          [ 1.98e-09,  2.64e-08, -2.84e-12]]),
   array([[ 5.57e-03,  5.74e-02,       nan],
          [ 3.17e-03,  4.11e-02, -9.66e-02],
          [ 6.40e-04,  1.02e-02, -1.39e-02],
          [ 2.51e-04,  3.49e-03, -1.38e-03],
          [ 4.51e-05,  7.88e-04, -2.35e-04],
          [ 1.10e-05,  1.84e-04, -1.29e-05],
          [ 7.23e-07,  1.33e-05, -2.53e-07],
          [ 7.23e-07,  1.33e-05,  1.85e-05],
          [ 1.10e-06,  1.70e-05, -3.51e-08],
          [ 3.49e-07,  5.10e-06, -4.98e-10],
          [ 3.93e-08,  6.59e-07, -6.08e-10],
          [ 4.88e-09,  1.12e-07, -3.27e-12]]),
   array([[ 2.67e-03,  3.91e-02,       nan],
          [ 1.08e-03,  1.05e-02,  1.04e-02],
          [ 1.84e-04,  2.51e-03, -1.68e-02],
          [ 6.14e-05,  1.27e-03, -8.18e-05],
          [ 1.85e-05,  2.69e-04, -1.50e-05],
          [ 3.47e-06,  5.99e-05, -1.11e-06],
          [ 3.47e-06,  5.99e-05,  1.40e-05],
          [ 9.94e-07,  1.37e-05, -3.10e-08],
          [ 6.48e-07,  1.01e-05, -2.67e-09],
          [ 3.36e-07,  5.90e-06,  6.08e-10],
          [ 3.45e-08,  5.38e-07, -1.74e-09],
          [ 6.62e-09,  1.23e-07, -6.71e-12]]),
   array([[ 3.27e-05,  3.30e-04,       nan],
          [ 9.96e-06,  1.44e-04, -6.22e-06],
          [ 3.83e-06,  4.72e-05, -1.55e-07],
          [ 2.91e-07,  3.97e-06, -1.13e-07],
          [ 5.48e-08,  9.80e-07, -2.56e-10],
          [ 7.03e-09,  9.62e-08, -6.25e-12]]),
   array([[7.75e-09, 1.38e-07,      nan]])],
  'temperature': 298.15,
  'vibdisps': array([[[ 0.71,  0.  ,  0.  ],
          [-0.71,  0.  ,  0.  ]]]),
  'vibfconsts': array([25.622]),
  'vibfreqs': array([1648.8846]),
  'vibirs': array([0.]),
  'vibrmasses': array([15.9949]),
  'vibsyms': ['A'],
  'zpve': 0.003756,
  'molecule_initial': Molecule Summary
  Site: O (0.0000, 0.0000, 0.0000)
  Site: O (2.0000, 0.0000, 0.0000),
  'molecule_final': Molecule Summary
  Site: O (0.3974, 0.0000, 0.0000)
  Site: O (1.6026, 0.0000, 0.0000),
  'homo_energies': [-7.054007346511501, -11.618445074798501],
  'lumo_energies': [4.2384453353880005, -3.9423854660440005],
  'homo_lumo_gaps': [11.292452681899501, 7.6760596087545006],
  'min_homo_lumo_gap': 7.6760596087545006},
 'metadata': {'package': 'Gaussian',
  'methods': ['DFT', 'DFT', 'DFT', 'DFT', 'DFT', 'DFT', 'DFT'],
  'success': True,
  'legacy_package_version': '16revisionA.03',
  'package_version': '2016+A.03',
  'platform': 'ES64L',
  'basis_set': 'def2TZVP',
  'functional': 'M06L',
  'cpu_time': ['0:00:00', '0:00:21.200000'],
  'wall_time': ['0:00:00', '0:00:05.600000']},
 'task_label': None,
 'tags': None,
 'last_updated': '2022-01-28 07:44:54.179872'}

If someone wanted to do post-processing before populating the TaskDocument, that could be done like:

from atomate2.common.schemas import cclib
# Read the log file and do a Mayer bond order + Bader analysis
doc = cclib.TaskDocument.from_logfile(".", ".log", analysis= ['MBO', 'Bader'])

If someone wanted to include the Molecule objects for the full optimization trajectory in their TaskDocument, that could be done like:

from atomate2.common.schemas import cclib
# Read the log file and do a Mayer bond order + Bader analysis
doc = cclib.TaskDocument.from_logfile(".", ".log", store_trajectory=True)

This scheme should support the following codes, which are all supported by cclib: ADF, DALTON, Firefly, GAMESS, Gaussian, Jaguar, Molcas, Molpro, MOPAC, NWChem, ORCA, Psi4, Q-Chem, Turbomole.

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2022

Codecov Report

Merging #64 (7d799a8) into main (fe5d8a6) will increase coverage by 0.21%.
The diff coverage is 74.54%.

@@            Coverage Diff             @@
##             main      #64      +/-   ##
==========================================
+ Coverage   70.39%   70.61%   +0.21%     
==========================================
  Files          45       46       +1     
  Lines        3804     3972     +168     
  Branches      576      624      +48     
==========================================
+ Hits         2678     2805     +127     
- Misses        979     1005      +26     
- Partials      147      162      +15     
Impacted Files Coverage Δ
src/atomate2/vasp/schemas/calculation.py 83.73% <ø> (ø)
src/atomate2/vasp/schemas/task.py 90.76% <50.00%> (ø)
src/atomate2/common/schemas/cclib.py 72.48% <72.48%> (ø)
src/atomate2/utils/path.py 93.54% <100.00%> (-6.46%) ⬇️
src/atomate2/vasp/flows/core.py 93.91% <0.00%> (+0.27%) ⬆️
src/atomate2/common/schemas/molecule.py 100.00% <0.00%> (+5.55%) ⬆️

@Andrew-S-Rosen
Copy link
Member Author

Andrew-S-Rosen commented Jan 27, 2022

@utf, I still have to finish the tests, but two questions for you:

  1. Do I need to convert the cclib-generated numpy arrays to lists in the TaskDocument?
  2. Could you help me out with the mypy linting errors?

The TaskDocument is pretty much done being written now. :)

@Andrew-S-Rosen
Copy link
Member Author

Andrew-S-Rosen commented Jan 28, 2022

Alright! We should be good to go with this, pending any suggestions you may have!

With one small exception. There is a (currently commented) test I added to the VASP TaskDocument and the new cclib TaskDocument, neither of which I can get to pass. It's a test of the additional_fields dictionary. This might be a bug, including with the VASP TaskDocument. Let me know what you think -- the test is here for the VASP TaskDocument and here for the cclib TaskDocument.

@Andrew-S-Rosen Andrew-S-Rosen changed the title [WIP] A General cclib-based Taskdoc A General cclib-based Taskdoc Jan 28, 2022
@utf
Copy link
Member

utf commented Jan 28, 2022

Hi @arosen93

Thanks for this. I think this is a great addition.

To answer your questions:

  1. You don't need to convert numpy arrays to lists. That will happen automatically by jsanitize. However, it means you shouldn't rely on having numpy arrays when you get outputs from the database - you always need to convert to numpy arrays first.
  2. For the failing test, maybe this line doc.copy(update=additional_fields) should actually be doc = doc.copy(update=additional_fields)?

As for the subdocuments (i.e., anAttributes document and and Metadata document), I think they could be useful to document the schema more, but I guess they could be added in later.

I just have a couple of questions for you:

  1. How heavy is the cclib dependency? Does it have any components that are written in cython or could cause compilation issues? What do you think about adding it as an optional dependency for now, until there is support for at least one comp chem code?
  2. How much structure is there to the attributes and metadata keys? If they have a well defined schema then it will definitely be worth adding new documents for these subfields.

@Andrew-S-Rosen
Copy link
Member Author

Andrew-S-Rosen commented Jan 28, 2022

Thanks for the useful feedback!

My questions:

  1. Sounds good. I think that's preferable anyway.
  2. I tried that as well. I still get KeyError: 'additional_fields' in both the VASP TaskDocument and cclib.

Yours:

  1. When you pip install cclib, it's very dependency-light. It requires: numpy, periodictable, scipy>=1.2.0, and packaging>=19.0. The periodictable code only requires pyparsing and numpy again. There are additional packages in cclib's requirements.txt, but those won't install with pip and are unnecessary here. I don't know about cython specifically, but I doubt it. Regardless, I am fine with including cclib as an optional dependency.
  • One of my concerns, slightly independent of this, is that nobody will know about the cclib TaskDocument. It might not be desired for Q-Chem (@samblau, @espottesmith) where there is already extensive pymatgen infrastructure for parsing (and an existing schema in Atomate1), but for other codes it'd be useful. Hopefully people don't end up duplicating effort!
  1. There is significant structure to attributes and metadata. Some codes might support some attributes and not others, but there is consistency in their meaning and structure. The full list of descriptions lives here with even more detail here for attributes and here for metadata. I was a little hesitant about adding documents for each one because, a. there are a ton, and b. cclib gets updated regularly. While it's unlikely an attribute will be removed, new attributes get added all the time, and trying to keep up with the documentation seemed like it'd be burdensome. What are your thoughts? Right now there is just a code comment linking to the data description.

Before it was in a list of total energies (one for each opt step), which makes it slightly harder to query based on
@utf
Copy link
Member

utf commented Jan 29, 2022

Ok, before merging:

  1. Please could you add cclib as an optional dependency for now? Happy to change this at a later date.
  2. Can you also add a test the task documents can be jsanitized (see VASP task document tests for an example). I found when running VASP calcs using fireworks that sometimes the calculations failed and it was due to a serialisation error.

Going forward:

  1. I agree about highlighting the existence of the cclib task document. I guess we could have a page in the docs on which codes we support and summary of the support available. Kind of like the summary page on the cclib documentation or the summary of supported codes in the sumo docs: https://github.com/SMTG-UCL/sumo#feature-support-for-different-codes I'm happy to add this in a separate PR though?
  2. I will look into the additional fields error.

@utf
Copy link
Member

utf commented Jan 29, 2022

Ah, I figured out the additional_fields. Firstly, the code definitely should be doc = doc.copy(update=additional_fields). Secondly, the additional fields are added to the root level of the document, not to a subkey called "additional_fields". E.g., the test should be:

    # Make sure additional fields can be stored
    doc = TaskDocument.from_logfile(p, ".log", additional_fields={"test": "hi"})
    assert doc["test"] == "hi"

@Andrew-S-Rosen
Copy link
Member Author

That all sounds great to me! I'll make the suggested changes later today. I agree it's a good idea to test the jsanitization. That's gotten me more often than I'd like. And I think providing some details in the docs could definitely help, whether that be in the developer section or somewhere else. If we can put that in a separate PR, that'd be great.

Would you like me to update the additional_fields here? I can do that for the VASP TaskDocument as well as cclib. Thanks for debugging!

@utf
Copy link
Member

utf commented Jan 29, 2022

Would you like me to update the additional_fields here? I can do that for the VASP TaskDocument as well as cclib. Thanks for debugging!

That would be perfect, thanks!

@Andrew-S-Rosen
Copy link
Member Author

Andrew-S-Rosen commented Jan 30, 2022

Regarding making cclib an optional dependency, how would you like to do this? Currently:

  • I added the following line to extras_require in setup.py: "cclib": ["cclib>=1.7.1"]
  • I made a new requirements-optional.txt listing cclib>=1.7.1.
  • I added a try/except statement to let the user know they need to install cclib if they try to instantiate a cclib TaskDocument
  • In the .github testing workflow, I now pip install -r requirements-optional.txt

If you'd like anything changed with this setup, let me know! I also added the requested jsanitization check.

@utf
Copy link
Member

utf commented Jan 31, 2022

Thanks. This looks perfect.

@utf utf merged commit a8ec8be into materialsproject:main Jan 31, 2022
@utf
Copy link
Member

utf commented Jan 31, 2022

Hi @arosen93, I refactored this a little bit in: 0a6d616

The main changes were:

  • Pin to a specific cclib version in requirements-optional.txt so the testing environment is deterministic.
  • Don't pin to a specific cclib version in setup.py so that users can have flexibility in the package requirements.
  • Use monty.dev.requires for handling whether the cclib task document can be made. This is a nice design pattern so that you don't have to manually try/except. Note this only needs to be applied to the from_log_file, as technically there is no issue creating a cclib document from a dict of the data.
  • Use pytest.mark.skipif rather than the unittest equivalent as unittest is deprecated.
  • Import anything needed for the tests inside the tests themselves to speed pytest up a bit at the beginning.

@Andrew-S-Rosen
Copy link
Member Author

Thanks, that's a lot cleaner now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants