Skip to content

Code to support the paper: A. Fabrizio, K. R. Briling, and C. Corminboeuf, “SPAHM: the Spectrum of Approximated Hamiltonian Matrices representations”, Digital Discovery, 2022, 1, 286–294

License

Notifications You must be signed in to change notification settings

lcmd-epfl/SPAHM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SPAHM

This code supports the paper

A. Fabrizio, K. R. Briling, and C. Corminboeuf,
“SPAHM: the Spectrum of Approximated Hamiltonian Matrices representations”
Digital Discovery, 2022, 1, 286–294

This is a collection of scripts which allow to reproduce the results presented in the paper.

Requirements

Usage

See workflow.md for scripts to reproduce the results of the paper.

0. Compute the target quantum-chemical properties [optional]

00. code/00_run_dft.py runs a DFT computation and saves the Fock and density matrices and occupied orbital values.

(click to see the command-line options)
usage: code/00_run_dft.py [-h] --mol FILENAME --basis BASIS [--charge CHARGE]
                          [--func FUNC] [--dir DIR]

  -h, --help       show this help message and exit
  --mol FILENAME   path to molecular structure in xyz format
  --basis BASIS    AO basis set
  --charge CHARGE  total charge of the system (default=0)
  --spin SPIN      number of unpaired electrons (default=None) 
                   (use 0 to treat a closed-shell system in a UHF manner)
  --func FUNC      DFT functional (default=PBE0)
  --dir DIR        directory to save the output in (default=current dir)

Example:

$ for i in examples/xyz/*.xyz ; do \
    code/00_run_dft.py --mol $i --basis ccpvdz --dir examples/dft/; \
  done

computes the Fock and density matrices for the structures in examples/xyz/ at the PBE0/cc-pVDZ level
and writes them to the examples/dft/ directory.

Note

Here we use a toy set of only 10 molecules

01. code/01_get_properties.py takes the output of the previous script and computes the target properties
(HOMO, HOMO–LUMO gap, and dipole moment).

(click to see the command-line options)
usage: code/01_get_properties.py [-h] --mol FILENAME --basis BASIS
                                 [--charge CHARGE] [--func FUNC] [--dir DIR]

  -h, --help       show this help message and exit
  --mol FILENAME   path to molecular structure in xyz format
  --basis BASIS    AO basis set
  --charge CHARGE  total charge of the system (default=0)
  --func FUNC      DFT functional (default=PBE0)
  --dir DIR        directory to read the input from (default=current dir)

Example:

$ for i in examples/xyz/*.xyz ; do \
    code/01_get_properties.py --mol $i --basis ccpvdz --dir examples/dft/; \
  done \
  | tee examples/prop.dat ; awk '{print $4}' examples/prop.dat > examples/dipole.dat

writes the dipole moments to examples/dipole.dat in the format recognizable by other scripts (plain list).

1. Compute the SPAHM representations

Warning

Use the file names that can be unambiguously sorted (for example, mol000, mol001, mol002, ... mol998, mol999). Other formats (e.g. without leading zeros) may lead to representation-property mismatch.

10. code/10_get_guess_eigen.py computes a SPAHM representation for a given molecule.

(click to see the command-line options)
usage: code/10_get_guess_eigen.py [-h] --mol FILENAME --guess GUESS [--basis BASIS]
                                  [--charge CHARGE] [--func FUNC] [--dir DIR]

  -h, --help       show this help message and exit
  --mol FILENAME   path to molecular structure in xyz format
  --guess GUESS    initial guess type
  --basis BASIS    AO basis set (default=MINAO)
  --charge CHARGE  total charge of the system (default=0)
  --func FUNC      DFT functional for the SAD guess (default=HF)
  --dir DIR        directory to save the output in (default=current dir)

Available guesses:  'core'   (the core Hamiltonian),
                    'gwh'    (generalized Wolfsberg–Helmholtz guess),
                    'huckel' (extended Hückel guess),
                    'sad'    (superposition of atomic densities),
                    'sap'    (superposition of atomic potentials),
                    'lb'     (Laikov–Briling SAP-like guess),
                    'lb-hfs' (Laikov–Briling SAP-like guess with HFS-based parameters)

Example:

$ for i in examples/xyz/*.xyz ; do \
    code/10_get_guess_eigen.py --mol $i --guess lb --dir examples/lb/ ; \
  done

computes the SPAHM representations based on the LB guess in the MINAO basis for the structures in examples/xyz/
and writes them to the examples/lb/ directory.

11. code/11_compile_repr_core_valence.py prepares the representations for regression:
pads them with zeros and merges into one file.

(click to see the command-line options)
usage: code/11_compile_repr_core_valence.py [-h] --eig EIG_DIRECTORY
                                            [--geom GEOM_DIRECTORY] [--split SPLIT]
                                            [--dir DIR]

  -h, --help              show this help message and exit
  --eig EIG_DIRECTORY     directory with eigenvalues
  --geom GEOM_DIRECTORY   directory with xyz files
  --split SPLIT           whether to split the core and valence energies or not (default=False)
  --dir DIR               directory to save the output in (default=current dir)

Example:

$ code/11_compile_repr_core_valence.py --eig examples/lb/ --dir examples/

writes the merged representations to examples/X_lb.npy.

Note

You can use the following script instead of the two above if no fine tuning is needed

12. code/12_get_guess_repr.py computes SPAHM representations for a set of molecules.

(click to see the command-line options)
usage: 12_get_guess_repr.py [-h] --geom GEOM_DIRECTORY --guess GUESS
                            [--basis BASIS] [--charge CHARGE] [--spin SPIN]
                            [--func FUNC] [--dir DIR]

This program computes the chosen initial guess for a set of molecules.

optional arguments:
  -h, --help            show this help message and exit
  --geom GEOM_DIRECTORY
                        directory with xyz files
  --guess GUESS         initial guess type
  --basis BASIS         AO basis set (default=MINAO)
  --charge CHARGE       file with a list of charges
  --spin SPIN           file with a list of numbers of unpaired electrons
  --func FUNC           DFT functional for the SAD guess (default=HF)
  --dir DIR             directory to save the output in (default=current dir)

Example:

$ code/12_get_guess_repr.py --geom examples/xyz/ --dir examples/ --guess lb

writes the representations to examples/X_lb.npy.

2. Optimize the hyperparameters

20. code/20_hyperparameters.py optimizes the σ (kernel width) and η (regularization) hyperparameters
using the k-fold cross valigation and grid search.

(click to see the command-line options)
usage: code/20_hyperparameters.py [-h] --x REPR --y PROP [--test TEST_SIZE]
                                  [--splits SPLITS] [--kernel KERNEL]

  -h, --help        show this help message and exit
  --x REPR          path to the representations file
  --y PROP          path to the properties file
  --test TEST_SIZE  test set fraction (default=0.2)
  --splits SPLITS   k in k-fold cross validation (default=5)
  --kernel KERNEL   kernel type (G for Gaussian, L for Laplacian, myL for Laplacian for open-shell systems) (default L)

Example:

$ code/20_hyperparameters.py --x examples/X_lb.npy --y examples/dipole.dat | tail -n 4

5.183039e-01 3.005078e-01 | 1.000000e-05 31.622777
5.182629e-01 3.004739e-01 | 3.162278e-08 31.622777
5.182628e-01 3.004737e-01 | 1.000000e-10 31.622777
5.105925e-01 3.382477e-01 | 1.000000e+00 31.622777

performs a grid search for the optimal hyperparameters and prints a sorted list of prediction errors, their standard deviations, and the corresponding σ and η parameters. Here the optimal parameters are: σ = 32, η = 1e-10.

3. Compute the learning curve

30. code/30_regression.py computes the learning curve.

(click to see the command-line options)
usage: code/30_regression.py [-h] --x REPR --y PROP [--splits SPLITS] [--eta ETA]
                             [--sigma SIGMA] [--kernel KERNEL]

  -h, --help       show this help message and exit
  --x REPR         path to the representations file
  --y PROP         path to the properties file
  --test TEST_SIZE test set fraction (default=0.2)
  --splits SPLITS  number of splits (default=5)
  --eta ETA        η hyperparameter (default=1e-5)
  --sigma SIGMA    σ hyperparameter (default=32.0)
  --kernel KERNEL  kernel type (G for Gaussian, L for Laplacian, myL for Laplacian for open-shell systems) (default L)

Example:

$ code/30_regression.py --x examples/X_lb.npy --y examples/dipole.dat --sigma 32.0 --eta 1e-10

{'repr': 'examples/X_lb.npy', 'prop': 'examples/dipole.dat', 'test_size': 0.2, 'splits': 5, 'eta': 1e-10, 'sigma': 32.0, 'kernel': 'L'}
1       4.508885e-01    1.963332e-01
2       3.743045e-01    2.243524e-01
4       3.685258e-01    2.894607e-01
6       2.480406e-01    1.777623e-01
8       1.939497e-01    6.768116e-15

computes the learning curve (σ = 32, η = 1e-10) and prints the number of training molecules, the prediction error, and its standard deviation
(your learning curve can be different because for the partial training set random shuffling is used).

31 [optional]. code/31_final_error.py prints the full-training prediction error for each molecule.

(click to see the command-line options)
usage: code/31_final_error.py [-h] --x REPR --y PROP [--eta ETA] [--sigma SIGMA]
                              [--kernel KERNEL]

  -h, --help       show this help message and exit
  --x REPR         path to the representations file
  --y PROP         path to the properties file
  --test TEST_SIZE test set fraction (default=0.2)
  --eta ETA        η hyperparameter (default=1e-5)
  --sigma SIGMA    σ hyperparameter (default=32.0)
  --kernel KERNEL  kernel type (G for Gaussian, L for Laplacian, myL for Laplacian for open-shell systems) (default L)

Example:

$ code/31_final_error.py --x examples/X_lb.npy --y examples/dipole.dat --sigma 32.0 --eta 1e-10

{'repr': 'examples/X_lb.npy', 'prop': 'examples/dipole.dat', 'eta': 1e-10, 'test_size': 0.2, 'sigma': 32.0, 'kernel': 'L'}
1.154644e-01
2.724350e-01

prints the final prediction error for the test set molecules (just two in this case).

About

Code to support the paper: A. Fabrizio, K. R. Briling, and C. Corminboeuf, “SPAHM: the Spectrum of Approximated Hamiltonian Matrices representations”, Digital Discovery, 2022, 1, 286–294

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages