Skip to content

Obtain random PubChem molecules and post-process them if necessary.

License

Notifications You must be signed in to change notification settings

marcelmbn/GetRandomPCMol

Repository files navigation

Python versions code style

GetRandomPCMol

Python program for obtaining random PubChem molecules and generating conformer ensembles for them (optionally).

Dependencies

getrandompcmol in its current state depends on existing installations of

Installation

After cloning the code via git clone git@github.com:grimme-lab/GetRandomPCMol.git, a new virtual conda environment with required Python pre-requisites can be set up with:

conda env create -f environment.yml
conda activate getrandompcmol

getrandompcmol can be installed into this environment with

pip install -e .

The flag -e allows modification of the code without the need to reinstall.

Use

After installation, the package can generate a data set of random molecules. The randomization is based on a random number generator in NumPy.

getrandompcmol -n 500 --opt --crest protonate --maxnumat 35 --maxcid 1000000 --seed 27051997 --evalconf 5 10

The distinct keywords are described as follows:

  • -n/--n <int>: controls the number of molecules generated.
  • --maxnumat <int>: Maximally allowed number of atoms per molecule.
  • --maxcid <int>: Range between 1 and <maxcid>, in which random CIDs are generated.
    • --opt: Optimize the molecules using GFN2-xTB in xtb.
  • --crest {normal,protonate}: Generate conformer ensembles from the given structures via crest. If protonated is used instead of normal, every structure is protonated by crest before the actual conformer search.
  • --seed <int>: Starting seed for random number generation.
  • --evalconf <int> <int>: Range of conformer ensemble size allowed for post-processing.

If only the evaluation and post-processing of a previously generated conformer ensemble is desired, use --evalconfonly. Further information on possible input flags is available via --help.

CIDs and names of generated molecules are saved into a file compounds.txt. Additionally, (if --crest) is active, each directory contains a file conformer.json, in which relevant properties of the generated ensemble are saved.

The crest conformer ensemble generation is executed in parallel. The parallelization adapts to the number of available cores on your machine and the requested number of molecules.

Source code

The source code is in the src/getrandompcmol directory. Here, also some dunder files can be found:

Setup files and Packaging

Packaging is done with setuptools, which is configured through the pyproject.toml and/or setup.cfg/setup.py files.

pyproject.toml vs. setup.cfg vs setup.py

The setup.py file is a Python script, and configuration is passed through keyword arguments of setuptools.setup(). This is not recommended due to possible security and parsing issues. The same setup can be accomplished in a declarative style within setup.cfg, and setup.py remains mostly empty only calling setuptools.setup(). The pyproject.toml file aims to unify configuration files including various tools like black or pytest. For packaging, it is very similar to setup.cfg. However, pyproject.toml has not been adopted as the default yet, and many projects still use setup.cfg to declare the packaging setup. Note that setup.py is not necessary if a pyproject.toml is present.

pyproject.toml

  • minimal build specification to use with setuptools
  • configuration of other tools (black, pytest, mypy, ...)

setup.cfg


The package can be installed with pip install . or something like pip install . [dev] to also install additional dependencies specified in setup.cfg's options.extras_require. Pass the -e flag for editable mode, which loads the package from the source directory, i.e., changing the source code does not require a new installation.

About

Obtain random PubChem molecules and post-process them if necessary.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published