Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigChem implementation #199

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

BigChem implementation #199

wants to merge 11 commits into from

Conversation

hjnpark
Copy link
Collaborator

@hjnpark hjnpark commented Apr 10, 2024

geomeTRIC will check to see if BigChem and a given engine are working properly when --bigchem yes argument is provided. If BigChem is ready, the Hessian and NEB calculation will be carried in parallel via BigChem. Currently TeraChem, Psi4, and QChem are available.

Here are the changes:

  1. --port changed to --wqport.
  2. method and basis attributes were added to the engines.
  3. number_output function was added to Psi4 engine.
  4. calc_new function in QCEngineAPI was modified to be compatible with TeraChem (this change is for QCEngine).
  5. BigChem related functions were added in nifty.py.
  6. Minor changes were made to docstrings and contributors' information in the scripts.
  7. Tests were added.

@leeping
Copy link
Owner

leeping commented Apr 11, 2024

I reviewed the code and it looks great; @coltonbh could you please take a quick look?

@coltonbh
Copy link
Contributor

Hi @leeping and @hjnpark. Got tied up today. Will review first thing tomorrow!

Copy link
Contributor

@coltonbh coltonbh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level changes I would suggest:

  1. Do not use the global switches turning BigChem on/off. This will lead to great unhappiness! You've hidden when the system will/won't use BigChem and why inside of checking functions (CheckBigChem, etc.) that bury all exceptions and run calculations against a system that may have hours of work on a queue. You do not want this :) Instead I would simplify your architecture by adding a bigchem: bool = False keyword argument to the functions (such as calc_cartensian_hessian) where you added a BigChem option. This will maintain your existing API while adding new, optional functionality. Users (including your own code) can explicitly declare when they want to use bigchem like this calc_cartensian_hessian(..., bigchem=True).
  2. Remove checks for BigChem readiness. If the system is not available you WANT to raise those exceptions and report back to the end user why they were not able to use BigChem, expecidally since they now explicitly passed a bigchem = True argument either to the function call or the command line arguments.
  3. Definitely remove your CheckBigChem checks from def __init__(...) functions. These calls would hang your program for potentially hours if there is work on the queue. You can assume BigChem is available, call it in your code, and if it is not available have it report back to end users the exception to the end user explaining why. This is the way 🙌

The mental model you should have for BigChem is simply that they are functions you can call to get back results. In a sense, it is akin to an engine in your system, very much like the QCSchema engine. You can ask BigChem for energy, gradient, or (parallel) hessian objects, just like any other engine, and you can tell this engine what subprogram (i.e., terachem, psi4) to use. The difference is you can ask for many results in parallel (like the hessian or NEB beads).

This may seem like a lot of feedback but I think we can eliminate 80% of the code you've written and have a cleaner, easier to use, easier to understand implementation with a few architectural modifications :)

Feel free to ask any question for clarification!

Comment on lines 802 to 841
#==============================#
#| BigChem stuff |#
#==============================#

BIGCHEM = False

def CheckBigChem(engine):
global BIGCHEM
try:
from bigchem import compute
from qcio import Molecule as qcio_Molecule, ProgramInput

molecule = qcio_Molecule(
symbols=["O", "H", "H"],
geometry=[
[0.0, 0.0, 0.0],
[0.52421003, 1.68733646, 0.48074633],
[1.14668581, -0.45032174, -1.35474466],
],
)

prog_input = ProgramInput(molecule=molecule,
calctype='energy',
model={"method":'hf', "basis":'3-21g'})
output = compute.delay(engine, prog_input)

BIGCHEM = output.get().success

except:
BIGCHEM = False
logger.warning("BigChem/%s is not working properly. Calculations will be carried linearly." %engine)

def BigChemReady():
global BIGCHEM
return BIGCHEM

def BigChemOff():
global BIGCHEM
BIGCHEM = False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not check BigChem with a calculation. The queue may have many calculations waiting and this computation could wait for hours, for example.

What is it you need to check? Perhaps I can help you think of a better way to implement it. You can simply try to run a calculation and if BigChem isn't running it will raise an exception because it will not be able to connect to Redis (your broker) to deliver the message. This error message will inform the end user about the issue that needs fixing instead of burying it with no output, as this function does here.

Burying exceptions with naked except: statements is considered bad practice because it hides from the end user the cause of the problem and the won't know how to take action to fix it :)

Also, in Python the correct way to define function is with def snake_case and not def CaptialCase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, rather than having these global "switches" I would add bigchem: bool = False to the functions where it might be used so that end users can easily use/not use BigChem in an explicit way, rather than hiding it's use in a buried global value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @coltonbh, I replaced the global BigChem switch with the function you suggested. It would be great if you could review the changes. Thank you!

Comment on lines 137 to 188

elif BigChemReady():
# If BigChem is ready, it will be used to parallelize the Hessian calculation.
logger.info("BigChem will be used to calculate the Hessian. \n")
from qcio import Molecule as qcio_Molecule, ProgramInput
from bigchem import compute, group

elems = molecule[0].elem

# Uncommenting the following 6 lines and commenting out the rest of the lines will use BigChem's parallel_hessian function.
#from bigchem.algos import parallel_hessian
#qcio_M = qcio_Molecule(symbols=elems, geometry=coords.reshape(-1,3))
#input = ProgramInput(molecule=qcio_M, model={"method":engine.method, "basis": engine.basis}, calctype='hessian')
#output = parallel_hessian("psi4", input).delay()
#rec = output.get()
#Hx = rec.results.hessian

# Creating a list containing qcio Molecule obejcts with different geometries.
molecules = []
for i in range(nc):
coords[i] += h
molecules.append(qcio_Molecule(symbols=elems, geometry=coords.reshape(-1,3)))
coords[i] -= 2*h
molecules.append(qcio_Molecule(symbols=elems, geometry=coords.reshape(-1,3)))
coords[i] += h

# Submitting calculations
outputs = group(compute.s(engine.__class__.__name__.lower(),
ProgramInput(molecule=qcio_M, calctype='gradient',
model={"method":engine.method, "basis": engine.basis},
extras={"order":i}),
) for i, qcio_M in enumerate(molecules)).apply_async()

# Getting the records
records = outputs.get()
assert len(records) == nc*2

# Grouping the recrods
grouped_records = list(zip(records[::2], records[1::2]))

# Iterating through the grouped records to calculate the Hessian
for i, (fwd_rec, bak_rec) in enumerate(grouped_records):
# Double checking the order
assert fwd_rec.input_data.extras["order"] == i*2
assert bak_rec.input_data.extras["order"] == fwd_rec.input_data.extras["order"] + 1
gfwd = fwd_rec.results.gradient.ravel()
gbak = bak_rec.results.gradient.ravel()
Hx[i] = (gfwd-gbak)/(2*h)

# Deleting the records in the backend
outputs.forget()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using BigChem without the end user knowing, which can lead to surprising behavior, errors, and code paths, I would instead recommend adding a new keyword argument bigchem: bool = False to calc_cartesian_hessian and then have an if bigchem: ` statement for this code block. Also, you can just use from bigchem import parallel_hessian and use it in place of your hand coded hessian assembly and then this would be a one-liner :)

Comment on lines 110 to 112
# Check BigChem.
if kwargs.get('bigchem', False):
CheckBigChem(kwargs.get('engine', 'terachem'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly recommend removing this from your def __init__(...) method. This CheckBigChem function is executing an ab initio calculation against a distributed system that may have hundreds (or thousands!) of items on a job queue. Even without anything on the queue this will add at least a few seconds of overhead to a simple instantiation of the OptParams object.

Comment on lines 257 to 258
if kwargs.get('bigchem', False):
CheckBigChem(kwargs.get('engine', 'terachem'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto to the comment above. This will add seconds to hours of delay time to the instantiation of this object!! Do not do this! You do not need to check that BigChem is "ready" by performing a calculation against the system--this could results in hours of waiting if the queue is very full--you do not want this. BigChem can simply be trusted to be running and if someone tries a BigChem calculation without the system being on, it will raise the correct exception (telling the end user that the code cannot connect to the BigChem system and why).

grp_hessian.add_argument('--port', type=int, help='Work Queue port used to distribute Hessian calculations. Workers must be started separately.')
grp_hessian.add_argument('--frequency', type=str2bool, help='Perform frequency analysis whenever Hessian is calculated, default is yes/true.\n ')
grp_hessian.add_argument('--wqport', type=int, help='Work Queue port used to distribute Hessian calculations. Workers must be started separately. \n ')
grp_hessian.add_argument('--bigchem', type=str2bool, help='Provide "Yes" to use BigChem for performing the Hessian calculation in parallel. \n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have this flag communicate to your calc_cartesian_hessian function by passing bigchem=True, after you change the implementation of the calc_cartensian_hessian function.

Comment on lines 135 to 137
# Turning off BigChem
geometric.nifty.BigChemOff()
assert not geometric.nifty.BigChemReady()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 145 to 147
# Turning on BigChem
geometric.nifty.CheckBigChem('qchem')
assert geometric.nifty.BigChemReady()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 170 to 172
# Turning off BigChem
geometric.nifty.BigChemOff()
assert not geometric.nifty.BigChemReady()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 180 to 182
# Turning on BigChem
geometric.nifty.CheckBigChem('terachem')
assert geometric.nifty.BigChemReady()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 210 to 212
# Turning off BigChem
geometric.nifty.BigChemOff()
assert not geometric.nifty.BigChemReady()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

@coltonbh coltonbh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks dramatically better! Very nice!

I've added a note about using pytest fixtures so you can DRY up your test code and reduce the code you're adding while making your tests easier to reason about. You can extend this fixture pattern to your other tests as well (in a separate PR) and make your whole tests suite much smaller.

This is already a huge improvement though! We've cut out over 100 lines of code and have something much easier to reason about.

At this point I'd say your integration with BigChem is much cleaner and easier to understand. Up to you if you'd like to clean up your tests with fixtures. It's a good pattern to learn and will make writing tests much faster and easier.

You need to add bigchem to your setup.py file. If you don't want to install bigchem by default, add it to extra_requires. So either:

setup(
        ....
        'numpy',  # Main dependencies that are always installed
        'bigchem', # Either here
    ],
    extras_require={
        'bigchem': ['bigchem']  # Or here (Optional dependencies)
    }
)

Nice work 🙌

P.S. If you add it to extra_require then you would install it with pip install geometric[bigchem]. That's what those [someextras] notation means on install--also install these extras. The array can contain a whole list of packages so for example if you had some visualization component to geometric and you want it to be optionally installable you could list all the packages required for that like this:

    ...
    extra_require={
        'visualization': ['pkg1', 'pkg2', 'pkg3']
}

And then if you wanted that visualization feature you'd do pip install geometric[visualization].

Comment on lines 72 to 77
using_bigchem_psi4 = pytest.mark.skipif(
(not bigchem_found("psi4")), reason="BigChem/Psi4 is not working. please install the package to enable tests")
using_bigchem_qchem = pytest.mark.skipif(
(not bigchem_found("qchem")), reason="BigChem/QChem is not working. please install the package to enable tests")
using_bigchem_terachem = pytest.mark.skipif(
(not bigchem_found("terachem")), reason="BigChem/TeraChem is not working. please install the package to enable tests")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to review!

Comment on lines 49 to 62
def bigchem_found(engine):
geometric.nifty.CheckBigChem(engine)
found = geometric.nifty.BigChemReady()
geometric.nifty.BigChemOff()
return found

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than perform checks, you may just want to use pytest's built in mark feature

@pytest.mark.bigchem # or maybe @pytest.mark.integration for all tests that utilize external programs
def my_bigchem_tests(...):
    ...

Then you can register these markets in your pyproject.toml file (or wherever you keep your pytest configuration:

[pytest]
markers =
    slow: mark a test as slow running
    integration: mark test as integration test

And then control which tests run with

pytest -m 'not integration`

or you can deselect integration test by default by updating your pytest configuration with something like this:

[tool.pytest.ini_options]
addopts = "-m 'not integration'"
markers = [
    "integration: marks tests as integration (deselect with '-m \"not integration\"')",
]

Just a few additional ways to achieve the same outcome without needing to run checks over and over to see if a an program is available :)

Comment on lines 109 to 115
M, engine = geometric.prepare.get_molecule_engine(
input=os.path.join(datad, "hcn_neb_input.psi4in"),
chain_coords=os.path.join(datad, "hcn_neb_input.xyz"),
images=11,
neb=True,
engine="psi4",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pieces of code like this that are repeated boilerplate for your tests, I would recommend a pytest fixture. In your conftest.py file (should live in top level of tests directory):

@pytest.fixture
def molecule_engine_neb():
"""Return the Molecule and Engine for an NEB Calculation."""
    def select_engine(engine: str):
        return geometric.prepare.get_molecule_engine(
            input=os.path.join(datad, "hcn_neb_input.psi4in"),
            chain_coords=os.path.join(datad, "hcn_neb_input.xyz"),
            images=11,
            neb=True,
            engine=engine,
    )

    return create_prog_input

Then you can use it in your function like this. Pytest will automatically inject the fixture as an argument at runtime:

@addons.using_psi4
@addons.using_bigchem
def test_psi4_bigchem(localizer, molecule_engine_neb):
    M, engine = molecule_engine_neb('psi4')
    # Continue with test

This helps reduce all the repeated boilerplate code in your test suite (DRY's up your code!) and lets your focus on what you want actually test rather than adding hundreds of lines of repeated code.

You can generalize a fixture like this further for other use cases or create other fixtures for repeated test prep boilerplate :) I'm mentioning this because it's a bit concerning to add ~500 lines of code for a few simple function calls in BigChem! That is a lot of code to have to maintain in the future.

Copy link
Collaborator Author

@hjnpark hjnpark Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @coltonbh, Thank you for teaching me about pytest.fixture. I was able to remove repetitive lines in the test scripts.

Comment on lines 49 to 53
def test_psi4_bigchem_hessian():
molecule, engine = geometric.prepare.get_molecule_engine(engine="psi4", input=os.path.join(datad, 'hcn_minimized.psi4in'))
coords = molecule.xyzs[0].flatten()*geometric.nifty.ang2bohr
hessian = geometric.normal_modes.calc_cartesian_hessian(coords, molecule, engine, tempfile.mkdtemp())
freqs, modes, G = geometric.normal_modes.frequency_analysis(coords, hessian, elem=molecule.elem, verbose=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another highly repeated section of code that is copy-and-pasted over and over again in the tests. I'd refactor this into a fixture and/or parent function to reduce replication. When we say "DRY your code" we mean "don't repeat yourself." So now you have something that works, go back and refactor repeated sections into shared functions or pytest fixtures (depending on how you want to use the function) so you have less code to maintain and it's easier to understand :)

Comment on lines +302 to +329
elif self.params.bigchem:
# If BigChem is available, it will be used to carry the single-point calculations.
from qcio import Molecule as qcio_Molecule, ProgramInput
from bigchem import compute, group
elems = self.Structures[0].M.elem
molecules = []

# Creating a list with qcio Molecule objects and submitting calculations.
for Structure in self.Structures:
molecules.append(qcio_Molecule(symbols=elems, geometry=Structure.cartesians.reshape(-1,3)))

outputs = group(compute.s(self.engine.__class__.__name__.lower(),
ProgramInput(molecule=qcio_M, calctype="gradient",
model={"method":self.engine.method, "basis": self.engine.basis},
extras={"order":i}),
) for i, qcio_M in enumerate(molecules)).apply_async()

# Getting the records
records = outputs.get()

# Passing the results to chain.ComputeEnergyGradient()
for i in range(len(self)):
assert records[i].input_data.extras["order"] == i
result = {"energy": records[i].results.energy, "gradient": np.array(records[i].results.gradient).ravel()}
self.Structures[i].ComputeEnergyGradient(result=result)

# Deleting the records
outputs.forget()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this instead if you want to make it simpler:

from bigchem import parallel_hessian

pi = ProgramInput(molecule=qcio_m, calctype='hessian', model={...}, extras={...})
fr = parallel_hessian(self.engine.basis, pi).delay()
output = fr.get()
fr.forget()
hessian = output.results.hessian

No worries if you want to keep your own implementation. You can see my implementation of this here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is for NEB, which won't need the Hessian calculation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! I mistakenly thought it was a hessian :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants