PyFlow: A Generalized Program for Running Custom Sequences of Quantum Chemistry Calculations using Slurm
PyFlow is a program designed to develop custom, modular, high-throughput quantum chemistry screening workflows to support the discovery of novel, sustainable materials. PyFlow offers significant flexibility and allows you to easily setup an automated workflow for computing ground or excited state molecular geometries and energies.
- Access to a high-performance computing (HPC) cluster
- Gaussian 16 and/or GAMESS version 2018-R1 or later
- Anaconda with Python 3.8+
-
Clone this GitHub repository into your home directory on the cluster.
git clone https://github.com/kuriba/PyFlow.git -
Go into the newly created
PyFlowdirectory and set up an Anaconda environment using the provided environment.yml file.conda env create --file environment.yml -
Activate the newly created environment.
conda activate pyflow -
Run the following setup command within the PyFlow directory.
pip install -e . -
Define the following environment variables in your .bashrc.*
export PYFLOW=~/PyFlow export SCRATCH=/path/to/your/scratch/
*Note: replace
/path/to/your/scratch/with the actual path to your scratch directory. -
If you intend to use GAMESS, create a directory called
scrwithin your scratch directory.
The heart of workflow customizability is in the workflow configuration file. The workflow configuration file is a JSON-formatted file that defines the steps in a workflow and instructions for how to run each step. This file has some specific formatting requirements but has the general structure shown below.
{
"default": {
"initial_step": "X",
"steps": {
"X": {
"program": "gaussian16",
"route": "#p pm7 opt",
"opt": true,
"conformers": true,
"dependents": ["Y"]
},
"Y": { ... },
"Z": { ... }
}
},
"alt": { ... }
}
The "default" key is a config_id that refers to a flow configuration. The second config_id, "alt", refers to another flow configuration. It's possible to define multiple flow configurations in a single file for the sake of organization. The desired workflow configuration can be selected at runtime. Each flow configuration is a JSON object that must specify two keys: initial_step and steps. The former declares the first step of the workflow, and the latter defines the specific instructions for running each step. Each step must define a JSON object of step parameters such as partition, memory, and time limits (see supported workflow step parameters below for an exhaustive list of supported step parameters). In the example configuration above, the defined steps are "X", "Y", and "Z".
The following step parameters are supported by all QC programs
| Parameter | Description | Data type | Default |
|---|---|---|---|
program |
the QC program to use | str |
none |
opt |
whether the step includes an optimization | bool |
true |
freq |
whether the step includes a frequency calculation | bool |
false |
single_point |
whether the step is a single-point calculation | bool |
false |
conformers |
whether the step has conformers | bool |
false |
proceed_on_failed_conf |
if true, allow molecules with failed conformers to proceed to the next step |
bool |
true |
attempt_restart |
whether to attempt to restart a calculation upon timeout or failure | bool |
false |
nproc |
number of cores to request through Slurm | int |
14 |
memory |
amount of memory to request, in GB | int |
8 |
time |
the time limit for the calculation, in minutes | int |
1400 |
time_padding |
the time limit for processing/handling calculation outputs (the overall time limit for the Slurm submission is time + time_padding) |
int |
5 |
partition |
the partition to request for the step | str |
short |
simul_jobs |
the number of jobs to simultaneously run | int |
50 |
save_outputs |
whether to save the results of a step in /work/lopez/workflows | bool |
false |
dependents |
a list of step IDs that are to be run after the completion of the current step | List[string] |
[] |
charge |
the charge by which to increment all molecules | int |
0 |
multiplicity |
the multiplicity of the molecules | int |
1 |
| QC program | Parameter | Description | Data type | Default |
| gaussian16 | route |
the full route for the calculation | str |
none |
rwf |
whether to save the .rwf file | bool |
false |
|
chk |
whether to save the .chk file | bool |
false |
|
| gamess | gbasis |
Gaussian basis set specification | str |
none |
runtyp |
the type of computation (e.g., energy, gradient, etc.) | str |
none |
|
dfttyp |
DFT functional to use (ab initio if unspecified) | str |
none |
|
maxit |
maximum number of SCF iteration cycles | int |
30 |
|
opttol |
gradient convergence tolerance, in Hartree/Bohr | float |
0.0001 |
|
hess |
selects the initial Hessian matrix | str |
depends on runtyp (see GAMESS documentation) |
|
nstep |
maximum number of steps to take | int |
50 for minimum search, 20 for transition state search | |
idcver |
the dispersion correction implementation to use | int |
none |
*refer to the documentation specific to each QC program for more details on valid arguments for each parameter
It is possible to manually create a workflow configuration file in any text editor, but this places the burden of properly formatting the JSON file and including required step parameters on the user. To simplify the creation of properly-formatted workflow configuration files, the program includes the build_config utility for creating custom workflows via the command line. To access the utility, use the following command, replacing new_config.json and default with the desired configuration file name and configuration ID, respectively.
pyflow build_config --config_file new_config.json --config_id defaultYou will see several prompts to enter step information and modify step parameters for your new workflow configuration (you can add a workflow configuration with a new ID to an existing configuration file by providing the path to the existing config file as the argument for --config_file).
Execution of a workflow is accomplished in three steps:
- Setting up a directory for the workflow
- Uploading molecules (as PDB files) to the
unopt_pdbsfolder of the workflow - Submitting the workflow
To create a directory for your workflow, go to your scratch directory and run the following command, replacing my_first_workflow with your desired workflow name. The argument for the config_file flag should be the path to the desired workflow configuration file, and the argument for the config_id flag specifies which configuration to use from the specified configuration file.
pyflow setup my_first_workflow --config_file /path/to/config/file --config_id "default"Next, place the molecules for the workflow in the unopt_pdbs folder of the workflow directory that was created in the previous step. The structures should use the PDB format. The files should be named with the InChIKey of the molecule, followed by an underscore, followed by the conformer ID*, starting from 0.
XXXXXXXXXXXXXX-YYYYYYYYYY-Z_0.pdb
XXXXXXXXXXXXXX-YYYYYYYYYY-Z_1.pdb
XXXXXXXXXXXXXX-YYYYYYYYYY-Z_2.pdb
XXXXXXXXXXXXXX-YYYYYYYYYY-Z_3.pdb
*Note: If you only have one conformer for each molecule, the PDB files should each have the conformer ID "0".
To submit the workflow, run the following command while you're located in the workflow directory. This command will set up the input files for the first step using the initial coordinates from the structures in the unopt_pdbs folder, then submit them as an array.
pyflow beginThe progress command is provided for easily monitoring the progress of a workflow. To use it, simply go to the directory of a running or completed workflow and execute the following command. This will output a small report on the overall progress of the calculations.
pyflow progressTo simplify the generation of Gaussian 16 input files and Slurm submission scripts, these utilities are accessible as their own actions: g16 and sbatch, respectively. Below you'll find several examples which demonstrate how to use these utilities to generate files.
Generating a Gaussian 16 input file requires two arguments: a route and a geometry file (for the initial coordinates).
In this first example, a Gaussian 16 input file named file.com will be generated with the coordinates from file.pdb and the route "#p pm7 opt". This example uses default values for the charge (0), multiplicity (1), nproc (14), and memory (8 GB).
pyflow g16 -r "#p pm7 opt" -g /path/to/geometry/file.pdbIt is possible to specify the charge, multiplicity, memory and CPU allocation as follows.
pyflow g16 -r "#p pm7 opt" -g /path/to/geometry/file.pdb --charge 1 --multiplicity 3 --memory 16 --nproc 16The file generator attempts to determine the format of the initial geometry file based on its file ending (pdb in the examples above). If the file ending does not match a known Open Babel format, you can specify the format with the --geometry_format flag (refer to the Open Babel documentation for a complete list of supported formats).
pyflow g16 -r "#p pm7 opt" -g /path/to/geometry/file.o --geometry_format xyzNote: use pyflow g16 --help for an exhaustive list of options available for generating Gaussian 16 input files.
Generating Slurm submission scripts requires two arguments: a jobname and a file with commands to run.
In this example, a Slurm submission script named generic_slurm_job.sbatch will be generated with the commands in the commands.txt text file.
pyflow sbatch -j generic_slurm_job -c /path/to/commands.txtA number of arguments can be used to customize the Slurm submission script. In the example below, the partition, time limit (in minutes), number of nodes, and memory per node (in GB) are specified.
pyflow sbatch -j another_generic_job -c /path/to/commands.txt --partition lopez --time 2880 --nodes 2 --memory 64It is also possible to generate a submission script for an array with the --array flag. In the following example, a Slurm array submission script will be generated with 500 jobs in the array limited to 40 simultaneously running jobs.
pyflow sbatch -j generic_array_job -c /path/to/commands.txt --array 500 --simul_jobs 40Note: use pyflow sbatch --help for an exhaustive list of options available for generating Slurm submission scripts.
Prof. Steven A. Lopez
Dr. Jordan Cox
Daniel Adrion
Fatemah Mukadum
Patrick Neal