RDKit molecules, atoms, bonds, conformers, and reactions support an interface, we call it the property interface, for storing arbitrary data that is used a lot internally but that can also very useful in other code. This post provides a quick overview of how properties work and what you can do with them

In [1]:
from rdkit import Chem

import rdkit
print(rdkit.__version__)

2024.09.4


# Property basics

The properties are stored in a key:value data structure (similar to a dictionary in Python). The keys must be strings but the values can be various types.

One obvious use of properties is to store the additional data found in an SDF file on the molecule. Here's an example of that:

In [2]:
import gzip
with gzip.open('/scratch/Data/PubChem/Compound_004500001_005000000.sdf.gz') as inf:
    suppl = Chem.ForwardSDMolSupplier(inf)
    ms = [next(suppl) for x in range(10)]

We can get a list of the properties present:

In [3]:
m = ms[0]
list(m.GetPropNames())

['PUBCHEM_COMPOUND_CID',
 'PUBCHEM_COMPOUND_CANONICALIZED',
 'PUBCHEM_CACTVS_COMPLEXITY',
 'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
 'PUBCHEM_CACTVS_HBOND_DONOR',
 'PUBCHEM_CACTVS_ROTATABLE_BOND',
 'PUBCHEM_CACTVS_SUBSKEYS',
 'PUBCHEM_IUPAC_OPENEYE_NAME',
 'PUBCHEM_IUPAC_CAS_NAME',
 'PUBCHEM_IUPAC_NAME_MARKUP',
 'PUBCHEM_IUPAC_NAME',
 'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
 'PUBCHEM_IUPAC_TRADITIONAL_NAME',
 'PUBCHEM_IUPAC_INCHI',
 'PUBCHEM_IUPAC_INCHIKEY',
 'PUBCHEM_XLOGP3_AA',
 'PUBCHEM_EXACT_MASS',
 'PUBCHEM_MOLECULAR_FORMULA',
 'PUBCHEM_MOLECULAR_WEIGHT',
 'PUBCHEM_OPENEYE_CAN_SMILES',
 'PUBCHEM_OPENEYE_ISO_SMILES',
 'PUBCHEM_CACTVS_TPSA',
 'PUBCHEM_MONOISOTOPIC_WEIGHT',
 'PUBCHEM_TOTAL_CHARGE',
 'PUBCHEM_HEAVY_ATOM_COUNT',
 'PUBCHEM_ATOM_DEF_STEREO_COUNT',
 'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
 'PUBCHEM_BOND_DEF_STEREO_COUNT',
 'PUBCHEM_BOND_UDEF_STEREO_COUNT',
 'PUBCHEM_ISOTOPIC_ATOM_COUNT',
 'PUBCHEM_COMPONENT_COUNT',
 'PUBCHEM_CACTVS_TAUTO_COUNT',
 'PUBCHEM_COORDINATE_TYPE',
 'PUBCHEM_BONDANNOT

And then retrieve the property values themselves with `GetProp()`:

In [4]:
m.GetProp('PUBCHEM_MOLECULAR_WEIGHT')

'516.3'

`GetProp()` returns the property values as strings, but we can also get them as specific types by asking for the type:

In [5]:
m.GetDoubleProp('PUBCHEM_MOLECULAR_WEIGHT')

516.3

In [6]:
m.GetIntProp('PUBCHEM_HEAVY_ATOM_COUNT')

31

The retrieval functions currently supported on molecules are:
- `GetProp()` -> string
- `GetDoubleProp()` -> floating point
- `GetIntProp()` -> integer
- `GetUnsignedProp()` -> unsigned integer
- `GetBoolProp()` -> boolean



It's possible to retrieve all of the properties, with the correct types, in one call:

In [7]:
m.GetPropsAsDict()

{'PUBCHEM_COMPOUND_CID': 4500001,
 'PUBCHEM_COMPOUND_CANONICALIZED': 1,
 'PUBCHEM_CACTVS_COMPLEXITY': 626,
 'PUBCHEM_CACTVS_HBOND_ACCEPTOR': 4,
 'PUBCHEM_CACTVS_HBOND_DONOR': 1,
 'PUBCHEM_CACTVS_ROTATABLE_BOND': 7,
 'PUBCHEM_CACTVS_SUBSKEYS': 'AAADceB7oABHAAAAAAAAAAAAGAAAAWAAAAAwYAAAAAAAAAAB0AAAHgYYAAAADQrF2ySz0IfMEAiqAidydACS0AthB7AdykA4ZoiIKCLBm5HEIAhgnALIyAcQgMAOhABQAAKAABQIAKAABQAAKAAAAAAAAA==',
 'PUBCHEM_IUPAC_OPENEYE_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-N-(2,4,5-trichlorophenyl)acetamide',
 'PUBCHEM_IUPAC_CAS_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]thio]-N-(2,4,5-trichlorophenyl)acetamide',
 'PUBCHEM_IUPAC_NAME_MARKUP': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-<I>N</I>-(2,4,5-trichlorophenyl)acetamide',
 'PUBCHEM_IUPAC_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-N-(2,4,5-trichlorophenyl)acetamide',
 'PUBCHEM_IUPAC_SYSTEMATIC_NAME

You can check whether or not a property is there:

In [8]:
m.HasProp('foo')

0

Asking for a property that's not present throws an exception:

In [9]:
m.GetProp('foo')

KeyError: 'foo'

And you can remove properties:

In [10]:
m.ClearProp('PUBCHEM_HEAVY_ATOM_COUNT')
m.HasProp('PUBCHEM_HEAVY_ATOM_COUNT')

0

# Special property types

Properties whose names start with an underscore - `_` - are considered to be private and any property can be marked as computed. These properties are not displayed by default by calls to `GetPropNames()` or `GetPropsAsDict()` for molecules.

One frequently used private property is `_Name`, which is read from the header of mol files:

In [11]:
m.GetProp('_Name')

'4500001'

You can see the full list of property names by passing the `includePrivate` and `includeComputed` flags to `GetPropNames()` or `GetPropsAsDict()`:

In [12]:
list(m.GetPropNames(includePrivate=True, includeComputed=True))

['__computedProps',
 '_Name',
 '_MolFileInfo',
 '_MolFileComments',
 '_MolFileChiralFlag',
 'numArom',
 '_StereochemDone',
 'PUBCHEM_COMPOUND_CID',
 'PUBCHEM_COMPOUND_CANONICALIZED',
 'PUBCHEM_CACTVS_COMPLEXITY',
 'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
 'PUBCHEM_CACTVS_HBOND_DONOR',
 'PUBCHEM_CACTVS_ROTATABLE_BOND',
 'PUBCHEM_CACTVS_SUBSKEYS',
 'PUBCHEM_IUPAC_OPENEYE_NAME',
 'PUBCHEM_IUPAC_CAS_NAME',
 'PUBCHEM_IUPAC_NAME_MARKUP',
 'PUBCHEM_IUPAC_NAME',
 'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
 'PUBCHEM_IUPAC_TRADITIONAL_NAME',
 'PUBCHEM_IUPAC_INCHI',
 'PUBCHEM_IUPAC_INCHIKEY',
 'PUBCHEM_XLOGP3_AA',
 'PUBCHEM_EXACT_MASS',
 'PUBCHEM_MOLECULAR_FORMULA',
 'PUBCHEM_MOLECULAR_WEIGHT',
 'PUBCHEM_OPENEYE_CAN_SMILES',
 'PUBCHEM_OPENEYE_ISO_SMILES',
 'PUBCHEM_CACTVS_TPSA',
 'PUBCHEM_MONOISOTOPIC_WEIGHT',
 'PUBCHEM_TOTAL_CHARGE',
 'PUBCHEM_ATOM_DEF_STEREO_COUNT',
 'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
 'PUBCHEM_BOND_DEF_STEREO_COUNT',
 'PUBCHEM_BOND_UDEF_STEREO_COUNT',
 'PUBCHEM_ISOTOPIC_ATOM_COUNT',
 'PUBCHEM_CO

# Adding your own properties

I'm demonstrating this for molecules, but the same thing works for the other types.

In [13]:
m = Chem.MolFromSmiles('CCC')
m.SetProp('prop1','val1')
m.SetIntProp('prop2',2)
m.SetDoubleProp('prop3',3.14159)

m.GetPropsAsDict()

{'prop1': 'val1', 'prop2': 2, 'prop3': 3.14159}

In [14]:
m.SetProp('computed1','val', computed=True)
m.SetProp('_private1','val', computed=False)

m.GetPropsAsDict()

{'prop1': 'val1', 'prop2': 2, 'prop3': 3.14159}

In [15]:
m.GetPropsAsDict(includeComputed=True)

{'numArom': 0,
 'prop1': 'val1',
 'prop2': 2,
 'prop3': 3.14159,
 'computed1': 'val'}

In [16]:
m.GetPropsAsDict(includePrivate=True,includeComputed=True)

{'__computedProps': <rdkit.rdBase._vectNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE at 0x7b88e3f5f040>,
 'numArom': 0,
 '_StereochemDone': 1,
 'prop1': 'val1',
 'prop2': 2,
 'prop3': 3.14159,
 'computed1': 'val',
 '_private1': 'val'}

# Properties and copying/serialization/pickling

In [17]:
m = Chem.MolFromSmiles('CC')
m.SetProp('prop1','v1')
m.SetProp('computed1','v2')
m.GetAtomWithIdx(0).SetIntProp('aprop',1)

Properties are copied when molecules are copied, either using the RDKit's recommended approach:

In [18]:
m2 = Chem.Mol(m)
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))


mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']


Or using the `copy` module:

In [19]:
import copy
m2 = copy.deepcopy(m)
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))


mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']


Properties are not, by default, captured when molecules are serialized (converted to binary):

In [20]:
m2 = Chem.Mol(m.ToBinary())
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))

mol: []
atom: []


But you can change this:

In [21]:
m2 = Chem.Mol(m.ToBinary(Chem.PropertyPickleOptions.AllProps))
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))


mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']


And finally, Python's pickling tool does not serialize properties by default:

In [22]:
import pickle

m2 = pickle.loads(pickle.dumps(m))
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))


mol: []
atom: []


But you can change this with a global variable:

In [23]:
Chem.SetDefaultPickleProperties(Chem.PropertyPickleOptions.AllProps)

m2 = pickle.loads(pickle.dumps(m))
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))

mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']


# Writing properties

Both the `SDWriter` and the `SmilesWriter` can write properties

In [24]:
from io import StringIO
m = Chem.MolFromSmiles('CC')
m.SetProp('prop1','v1')
m.SetProp('computed1','v2')


The SDWriter will by default write all non-private properties (include computed properties):

In [25]:
sio = StringIO()
with Chem.SDWriter(sio) as w:
    w.write(m)
print(sio.getvalue())


     RDKit          2D

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2990    0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
M  END
>  <prop1>  (1) 
v1

>  <computed1>  (1) 
v2

$$$$



But you can control which properties are written:

In [26]:
sio = StringIO()
with Chem.SDWriter(sio) as w:
    w.SetProps(['prop1'])
    w.write(m)
print(sio.getvalue())


     RDKit          2D

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2990    0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
M  END
>  <prop1>  (1) 
v1

$$$$



The `SmilesWriter` doesn't write properties by default, but we can tell it to:

In [27]:
sio = StringIO()
with Chem.SmilesWriter(sio) as w:
    w.SetProps(m.GetPropNames())
    w.write(m)
print(sio.getvalue())

SMILES Name prop1 computed1
CC 0 v1 v2

