BigChem implementation #199

hjnpark · 2024-04-10T17:28:25Z

geomeTRIC will check to see if BigChem and a given engine are working properly when --bigchem yes argument is provided. If BigChem is ready, the Hessian and NEB calculation will be carried in parallel via BigChem. Currently TeraChem, Psi4, and QChem are available.

Here are the changes:

--port changed to --wqport.
method and basis attributes were added to the engines.
number_output function was added to Psi4 engine.
calc_new function in QCEngineAPI was modified to be compatible with TeraChem (this change is for QCEngine).
BigChem related functions were added in nifty.py.
Minor changes were made to docstrings and contributors' information in the scripts.
Tests were added.

leeping · 2024-04-11T06:37:40Z

I reviewed the code and it looks great; @coltonbh could you please take a quick look?

coltonbh · 2024-04-11T23:31:58Z

Hi @leeping and @hjnpark. Got tied up today. Will review first thing tomorrow!

coltonbh

High level changes I would suggest:

Do not use the global switches turning BigChem on/off. This will lead to great unhappiness! You've hidden when the system will/won't use BigChem and why inside of checking functions (CheckBigChem, etc.) that bury all exceptions and run calculations against a system that may have hours of work on a queue. You do not want this :) Instead I would simplify your architecture by adding a bigchem: bool = False keyword argument to the functions (such as calc_cartensian_hessian) where you added a BigChem option. This will maintain your existing API while adding new, optional functionality. Users (including your own code) can explicitly declare when they want to use bigchem like this calc_cartensian_hessian(..., bigchem=True).
Remove checks for BigChem readiness. If the system is not available you WANT to raise those exceptions and report back to the end user why they were not able to use BigChem, expecidally since they now explicitly passed a bigchem = True argument either to the function call or the command line arguments.
Definitely remove your CheckBigChem checks from def __init__(...) functions. These calls would hang your program for potentially hours if there is work on the queue. You can assume BigChem is available, call it in your code, and if it is not available have it report back to end users the exception to the end user explaining why. This is the way 🙌

The mental model you should have for BigChem is simply that they are functions you can call to get back results. In a sense, it is akin to an engine in your system, very much like the QCSchema engine. You can ask BigChem for energy, gradient, or (parallel) hessian objects, just like any other engine, and you can tell this engine what subprogram (i.e., terachem, psi4) to use. The difference is you can ask for many results in parallel (like the hessian or NEB beads).

This may seem like a lot of feedback but I think we can eliminate 80% of the code you've written and have a cleaner, easier to use, easier to understand implementation with a few architectural modifications :)

Feel free to ask any question for clarification!

coltonbh · 2024-04-12T16:46:49Z

geometric/nifty.py

+#==============================#
+#|        BigChem stuff       |#
+#==============================#
+
+BIGCHEM = False
+
+def CheckBigChem(engine):
+    global BIGCHEM
+    try:
+        from bigchem import compute
+        from qcio import Molecule as qcio_Molecule, ProgramInput
+
+        molecule = qcio_Molecule(
+            symbols=["O", "H", "H"],
+            geometry=[
+                [0.0, 0.0, 0.0],
+                [0.52421003, 1.68733646, 0.48074633],
+                [1.14668581, -0.45032174, -1.35474466],
+            ],
+        )
+
+        prog_input = ProgramInput(molecule=molecule,
+                                  calctype='energy',
+                                  model={"method":'hf', "basis":'3-21g'})
+        output = compute.delay(engine, prog_input)
+
+        BIGCHEM = output.get().success
+
+    except:
+        BIGCHEM = False
+        logger.warning("BigChem/%s is not working properly. Calculations will be carried linearly." %engine)
+
+def BigChemReady():
+    global BIGCHEM
+    return BIGCHEM
+
+def BigChemOff():
+    global BIGCHEM
+    BIGCHEM = False
+


I would not check BigChem with a calculation. The queue may have many calculations waiting and this computation could wait for hours, for example.

What is it you need to check? Perhaps I can help you think of a better way to implement it. You can simply try to run a calculation and if BigChem isn't running it will raise an exception because it will not be able to connect to Redis (your broker) to deliver the message. This error message will inform the end user about the issue that needs fixing instead of burying it with no output, as this function does here.

Burying exceptions with naked except: statements is considered bad practice because it hides from the end user the cause of the problem and the won't know how to take action to fix it :)

Also, in Python the correct way to define function is with def snake_case and not def CaptialCase.

Also, rather than having these global "switches" I would add bigchem: bool = False to the functions where it might be used so that end users can easily use/not use BigChem in an explicit way, rather than hiding it's use in a buried global value.

Hi @coltonbh, I replaced the global BigChem switch with the function you suggested. It would be great if you could review the changes. Thank you!

coltonbh · 2024-04-12T16:50:45Z

geometric/normal_modes.py

+
+    elif BigChemReady():
+        # If BigChem is ready, it will be used to parallelize the Hessian calculation.
+        logger.info("BigChem will be used to calculate the Hessian. \n")
+        from qcio import Molecule as qcio_Molecule, ProgramInput
+        from bigchem import compute, group
+
+        elems = molecule[0].elem
+
+        # Uncommenting the following 6 lines and commenting out the rest of the lines will use BigChem's parallel_hessian function.
+        #from bigchem.algos import parallel_hessian
+        #qcio_M = qcio_Molecule(symbols=elems, geometry=coords.reshape(-1,3))
+        #input = ProgramInput(molecule=qcio_M, model={"method":engine.method, "basis": engine.basis}, calctype='hessian')
+        #output = parallel_hessian("psi4", input).delay()
+        #rec = output.get()
+        #Hx = rec.results.hessian
+
+        # Creating a list containing qcio Molecule obejcts with different geometries.
+        molecules = []
+        for i in range(nc):
+            coords[i] += h
+            molecules.append(qcio_Molecule(symbols=elems, geometry=coords.reshape(-1,3)))
+            coords[i] -= 2*h
+            molecules.append(qcio_Molecule(symbols=elems, geometry=coords.reshape(-1,3)))
+            coords[i] += h
+
+        # Submitting calculations
+        outputs = group(compute.s(engine.__class__.__name__.lower(),
+                        ProgramInput(molecule=qcio_M, calctype='gradient',
+                                     model={"method":engine.method, "basis": engine.basis},
+                                     extras={"order":i}),
+                        ) for i, qcio_M in enumerate(molecules)).apply_async()
+
+        # Getting the records
+        records = outputs.get()
+        assert len(records) == nc*2
+
+        # Grouping the recrods
+        grouped_records = list(zip(records[::2], records[1::2]))
+
+        # Iterating through the grouped records to calculate the Hessian
+        for i, (fwd_rec, bak_rec) in enumerate(grouped_records):
+            # Double checking the order
+            assert fwd_rec.input_data.extras["order"] == i*2
+            assert bak_rec.input_data.extras["order"] == fwd_rec.input_data.extras["order"] + 1
+            gfwd = fwd_rec.results.gradient.ravel()
+            gbak = bak_rec.results.gradient.ravel()
+            Hx[i] = (gfwd-gbak)/(2*h)
+
+        # Deleting the records in the backend
+        outputs.forget()
+


Rather than using BigChem without the end user knowing, which can lead to surprising behavior, errors, and code paths, I would instead recommend adding a new keyword argument bigchem: bool = False to calc_cartesian_hessian and then have an if bigchem: ` statement for this code block. Also, you can just use from bigchem import parallel_hessian and use it in place of your hand coded hessian assembly and then this would be a one-liner :)

coltonbh · 2024-04-12T16:55:47Z

geometric/params.py

+        # Check BigChem.
+        if kwargs.get('bigchem', False):
+            CheckBigChem(kwargs.get('engine', 'terachem'))


I would strongly recommend removing this from your def __init__(...) method. This CheckBigChem function is executing an ab initio calculation against a distributed system that may have hundreds (or thousands!) of items on a job queue. Even without anything on the queue this will add at least a few seconds of overhead to a simple instantiation of the OptParams object.

coltonbh · 2024-04-12T16:58:01Z

geometric/params.py

+        if kwargs.get('bigchem', False):
+            CheckBigChem(kwargs.get('engine', 'terachem'))


Ditto to the comment above. This will add seconds to hours of delay time to the instantiation of this object!! Do not do this! You do not need to check that BigChem is "ready" by performing a calculation against the system--this could results in hours of waiting if the queue is very full--you do not want this. BigChem can simply be trusted to be running and if someone tries a BigChem calculation without the system being on, it will raise the correct exception (telling the end user that the code cannot connect to the BigChem system and why).

coltonbh · 2024-04-12T16:59:25Z

geometric/params.py

-    grp_hessian.add_argument('--port', type=int, help='Work Queue port used to distribute Hessian calculations. Workers must be started separately.')
-    grp_hessian.add_argument('--frequency', type=str2bool, help='Perform frequency analysis whenever Hessian is calculated, default is yes/true.\n ')
+    grp_hessian.add_argument('--wqport', type=int, help='Work Queue port used to distribute Hessian calculations. Workers must be started separately. \n ')
+    grp_hessian.add_argument('--bigchem', type=str2bool, help='Provide "Yes" to use BigChem for performing the Hessian calculation in parallel. \n'


I'd have this flag communicate to your calc_cartesian_hessian function by passing bigchem=True, after you change the implementation of the calc_cartensian_hessian function.

coltonbh · 2024-04-12T17:17:48Z

geometric/tests/test_neb.py

+    # Turning off BigChem
+    geometric.nifty.BigChemOff()
+    assert not geometric.nifty.BigChemReady()


coltonbh · 2024-04-12T17:17:57Z

geometric/tests/test_neb.py

+    # Turning on BigChem
+    geometric.nifty.CheckBigChem('qchem')
+    assert geometric.nifty.BigChemReady()


coltonbh · 2024-04-12T17:18:06Z

geometric/tests/test_neb.py

+    # Turning off BigChem
+    geometric.nifty.BigChemOff()
+    assert not geometric.nifty.BigChemReady()


coltonbh · 2024-04-12T17:18:12Z

geometric/tests/test_neb.py

+    # Turning on BigChem
+    geometric.nifty.CheckBigChem('terachem')
+    assert geometric.nifty.BigChemReady()


coltonbh · 2024-04-12T17:18:23Z

geometric/tests/test_neb.py

+    # Turning off BigChem
+    geometric.nifty.BigChemOff()
+    assert not geometric.nifty.BigChemReady()


coltonbh

This looks dramatically better! Very nice!

I've added a note about using pytest fixtures so you can DRY up your test code and reduce the code you're adding while making your tests easier to reason about. You can extend this fixture pattern to your other tests as well (in a separate PR) and make your whole tests suite much smaller.

This is already a huge improvement though! We've cut out over 100 lines of code and have something much easier to reason about.

At this point I'd say your integration with BigChem is much cleaner and easier to understand. Up to you if you'd like to clean up your tests with fixtures. It's a good pattern to learn and will make writing tests much faster and easier.

You need to add bigchem to your setup.py file. If you don't want to install bigchem by default, add it to extra_requires. So either:

setup(
        ....
        'numpy',  # Main dependencies that are always installed
        'bigchem', # Either here
    ],
    extras_require={
        'bigchem': ['bigchem']  # Or here (Optional dependencies)
    }
)

Nice work 🙌

P.S. If you add it to extra_require then you would install it with pip install geometric[bigchem]. That's what those [someextras] notation means on install--also install these extras. The array can contain a whole list of packages so for example if you had some visualization component to geometric and you want it to be optionally installable you could list all the packages required for that like this:

    ...
    extra_require={
        'visualization': ['pkg1', 'pkg2', 'pkg3']
}

And then if you wanted that visualization feature you'd do pip install geometric[visualization].

coltonbh · 2024-04-12T17:35:39Z

geometric/tests/addons.py

+using_bigchem_psi4 = pytest.mark.skipif(
+    (not bigchem_found("psi4")), reason="BigChem/Psi4 is not working. please install the package to enable tests")
+using_bigchem_qchem = pytest.mark.skipif(
+    (not bigchem_found("qchem")), reason="BigChem/QChem is not working. please install the package to enable tests")
+using_bigchem_terachem = pytest.mark.skipif(
+    (not bigchem_found("terachem")), reason="BigChem/TeraChem is not working. please install the package to enable tests")


Adding to review!

coltonbh · 2024-04-13T20:58:49Z

geometric/tests/addons.py

+def bigchem_found(engine):
+    geometric.nifty.CheckBigChem(engine)
+    found = geometric.nifty.BigChemReady()
+    geometric.nifty.BigChemOff()
+    return found
+


Rather than perform checks, you may just want to use pytest's built in mark feature

@pytest.mark.bigchem # or maybe @pytest.mark.integration for all tests that utilize external programs def my_bigchem_tests(...): ...

Then you can register these markets in your pyproject.toml file (or wherever you keep your pytest configuration:

[pytest] markers = slow: mark a test as slow running integration: mark test as integration test

And then control which tests run with

pytest -m 'not integration`

or you can deselect integration test by default by updating your pytest configuration with something like this:

[tool.pytest.ini_options] addopts = "-m 'not integration'" markers = [ "integration: marks tests as integration (deselect with '-m \"not integration\"')", ]

Just a few additional ways to achieve the same outcome without needing to run checks over and over to see if a an program is available :)

coltonbh · 2024-04-16T17:59:06Z

geometric/tests/test_neb.py

+    M, engine = geometric.prepare.get_molecule_engine(
+        input=os.path.join(datad, "hcn_neb_input.psi4in"),
+        chain_coords=os.path.join(datad, "hcn_neb_input.xyz"),
+        images=11,
+        neb=True,
+        engine="psi4",
+    )


For pieces of code like this that are repeated boilerplate for your tests, I would recommend a pytest fixture. In your conftest.py file (should live in top level of tests directory):

@pytest.fixture def molecule_engine_neb(): """Return the Molecule and Engine for an NEB Calculation.""" def select_engine(engine: str): return geometric.prepare.get_molecule_engine( input=os.path.join(datad, "hcn_neb_input.psi4in"), chain_coords=os.path.join(datad, "hcn_neb_input.xyz"), images=11, neb=True, engine=engine, ) return create_prog_input

Then you can use it in your function like this. Pytest will automatically inject the fixture as an argument at runtime:

@addons.using_psi4 @addons.using_bigchem def test_psi4_bigchem(localizer, molecule_engine_neb): M, engine = molecule_engine_neb('psi4') # Continue with test

This helps reduce all the repeated boilerplate code in your test suite (DRY's up your code!) and lets your focus on what you want actually test rather than adding hundreds of lines of repeated code.

You can generalize a fixture like this further for other use cases or create other fixtures for repeated test prep boilerplate :) I'm mentioning this because it's a bit concerning to add ~500 lines of code for a few simple function calls in BigChem! That is a lot of code to have to maintain in the future.

Hi @coltonbh, Thank you for teaching me about pytest.fixture. I was able to remove repetitive lines in the test scripts.

coltonbh · 2024-04-16T18:01:48Z

geometric/tests/test_hessian.py

+def test_psi4_bigchem_hessian():
+    molecule, engine = geometric.prepare.get_molecule_engine(engine="psi4", input=os.path.join(datad, 'hcn_minimized.psi4in'))
+    coords = molecule.xyzs[0].flatten()*geometric.nifty.ang2bohr
+    hessian = geometric.normal_modes.calc_cartesian_hessian(coords, molecule, engine, tempfile.mkdtemp())
+    freqs, modes, G = geometric.normal_modes.frequency_analysis(coords, hessian, elem=molecule.elem, verbose=True)


This is another highly repeated section of code that is copy-and-pasted over and over again in the tests. I'd refactor this into a fixture and/or parent function to reduce replication. When we say "DRY your code" we mean "don't repeat yourself." So now you have something that works, go back and refactor repeated sections into shared functions or pytest fixtures (depending on how you want to use the function) so you have less code to maintain and it's easier to understand :)

coltonbh · 2024-04-16T18:09:10Z

geometric/neb.py

+        elif self.params.bigchem:
+            # If BigChem is available, it will be used to carry the single-point calculations.
+            from qcio import Molecule as qcio_Molecule, ProgramInput
+            from bigchem import compute, group
+            elems = self.Structures[0].M.elem
+            molecules = []
+
+            # Creating a list with qcio Molecule objects and submitting calculations.
+            for Structure in self.Structures:
+                molecules.append(qcio_Molecule(symbols=elems, geometry=Structure.cartesians.reshape(-1,3)))
+
+            outputs = group(compute.s(self.engine.__class__.__name__.lower(),
+                            ProgramInput(molecule=qcio_M, calctype="gradient",
+                                         model={"method":self.engine.method, "basis": self.engine.basis},
+                                         extras={"order":i}),
+                            ) for i, qcio_M in enumerate(molecules)).apply_async()
+
+            # Getting the records
+            records = outputs.get()
+
+            # Passing the results to chain.ComputeEnergyGradient()
+            for i in range(len(self)):
+                assert records[i].input_data.extras["order"] == i
+                result = {"energy": records[i].results.energy, "gradient": np.array(records[i].results.gradient).ravel()}
+                self.Structures[i].ComputeEnergyGradient(result=result)
+
+            # Deleting the records
+            outputs.forget()


You can do this instead if you want to make it simpler:

from bigchem import parallel_hessian pi = ProgramInput(molecule=qcio_m, calctype='hessian', model={...}, extras={...}) fr = parallel_hessian(self.engine.basis, pi).delay() output = fr.get() fr.forget() hessian = output.results.hessian

No worries if you want to keep your own implementation. You can see my implementation of this here.

This part is for NEB, which won't need the Hessian calculation.

Ah, I see! I mistakenly thought it was a hessian :)

… test_hessian.py is updated.

… the broker connection.

… been removed.

coltonbh reviewed Apr 12, 2024

View reviewed changes

coltonbh reviewed Apr 16, 2024

View reviewed changes

hjnpark added 11 commits May 13, 2024 14:15

Implementing BigChem to enable parallelization of Hessian calculation

d7e9661

BigChem v0.6.7 can be used for parallelizing the Hessian calculation.…

8e12423

… test_hessian.py is updated.

BigChem can be used for NEB. Tests for BigChem were added.

a1b4d0f

minor fix

e271601

BigChem/TeraChem tests were added

c7294c0

minor changes

59f0d0e

The global BigChem switch was replaced with a function that can check…

2ebad68

… the broker connection.

minor fix

16a1380

more fix for tests

9f1fafe

The pytest fixture was used to remove repetitive lines in test scripts.

fcadfd2

rebased and hessian_reset keyword for QCF/NEB and guessk for NEB have…

66b08a5

… been removed.

hjnpark force-pushed the master branch from bb1c128 to 66b08a5 Compare May 13, 2024 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigChem implementation #199

BigChem implementation #199

hjnpark commented Apr 10, 2024 •

edited

leeping commented Apr 11, 2024

coltonbh commented Apr 11, 2024

coltonbh left a comment •

edited

coltonbh Apr 12, 2024

coltonbh Apr 12, 2024

hjnpark Apr 15, 2024

coltonbh Apr 12, 2024

coltonbh Apr 12, 2024

coltonbh Apr 12, 2024

coltonbh Apr 12, 2024

coltonbh Apr 12, 2024

coltonbh Apr 12, 2024

coltonbh Apr 12, 2024

coltonbh Apr 12, 2024

coltonbh Apr 12, 2024

coltonbh left a comment •

edited

coltonbh Apr 12, 2024

coltonbh Apr 13, 2024

coltonbh Apr 16, 2024

hjnpark Apr 17, 2024 •

edited

coltonbh Apr 16, 2024

coltonbh Apr 16, 2024

hjnpark Apr 17, 2024

coltonbh Apr 17, 2024

		if kwargs.get('bigchem', False):
		CheckBigChem(kwargs.get('engine', 'terachem'))

BigChem implementation #199

Are you sure you want to change the base?

BigChem implementation #199

Conversation

hjnpark commented Apr 10, 2024 • edited

leeping commented Apr 11, 2024

coltonbh commented Apr 11, 2024

coltonbh left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coltonbh left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjnpark Apr 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjnpark commented Apr 10, 2024 •

edited

coltonbh left a comment •

edited

coltonbh left a comment •

edited

hjnpark Apr 17, 2024 •

edited