Datasets for open forcefield parameterization and development
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
Initial Molecule Choices


Datasets for open forcefield parameterization and development

General protocol for filtering ThermoML data for pure solvent data

ThermoML data compiled and filtered using ThermoPyL tool developed by Chodera Lab @ MSKCC (


  1. Pull full ThermoML archive

  2. Discard known erroneous data (j.fluid.2013.12.014 the only one I know of now)

  3. Define properties of interest to pass filter

  4. Allow only C, O and H atoms to pass

  5. Generate SMILES formulae from component names (NIH CirPy module)

  6. Apply filter for "=" and "#" to SMILES formulae (get rid of double and triple bonding)

  7. Generate CAS from component names (CirPy)

  8. Apply temperature and pressure filters (250 K - 400 K and 1 atm - 1000 atm)

  9. Keep only liquid phase data points

  10. Separate final large dataframe into subframes by property of interest a. Remove data with no associated uncertainties from subframes

  11. Generate counts by component and journal article for all dataframes

  12. Save everything as separate text .csv

Energetic data for model systems

Toy dataset for alkanes, alcohols, and ethers

Christopher I. Bayly developed a toy dataset of potential molecules of interest which is deposited in the "Model Systems" directory in the "AlkEthOH_distrib" subdirectory. Construction of this set is described in the README.txt there, which should be converted to md.