In [1]:
from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
import rdkit
print(rdkit.__version__)

2022.09.1


Start by getting some molecules to work with

In [3]:
ms = [x for x in Chem.SmilesMolSupplier('../data/BLSets_selected_actives.txt') if x.GetProp('_Name')=='CHEMBL204']
len(ms)

452

The idea of the new code is that all supported fingerprint algorithms can be used the same way: you create a generator for that fingerprint algorithm with the appropriate parameters set and then ask the generator to give you the fingerprint type you want for each molecule.

Let's look at how that works for Morgan fingerprints. When we create the generator we provide the radius and the size of the fingerprints to be generated:

In [5]:
fpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2,fpSize=2048)

The fingerprint generator knows how to create four separate types of fingerprints:
1. `fpgen.GetFingerprint(m)`: returns a bit vector of size `fpSize`
2. `fpgen.GetCountFingerprint(m)`: returns a count vector of size `fpSize`
3. `fpgen.GetSparseFingerprint(m)`: returns a sparse bit vector
4. `fpgen.GetSparseCountFingerprint(m)`: returns a sparse count vector

The sparse bit and count vectors are of variable size, depending on the fingerprint type, but are always very large (at least $2^{32}-1$).

Here's a demonstration of that:

In [3]:
fp = fpgen.GetFingerprint(ms[0])
cfp = fpgen.GetCountFingerprint(ms[0])
sfp = fpgen.GetSparseFingerprint(ms[0])
scfp = fpgen.GetSparseCountFingerprint(ms[0])

In [4]:
print(f'fp: {type(fp)} {len(fp)}')
print(f'cfp: {type(cfp)} {cfp.GetLength()}')
print(f'sfp: {type(sfp)} {len(sfp)}')
print(f'scfp: {type(scfp)} {scfp.GetLength()}')

fp: <class 'rdkit.DataStructs.cDataStructs.ExplicitBitVect'> 2048
cfp: <class 'rdkit.DataStructs.cDataStructs.UIntSparseIntVect'> 2048
sfp: <class 'rdkit.DataStructs.cDataStructs.SparseBitVect'> 4294967295
scfp: <class 'rdkit.DataStructs.cDataStructs.ULongSparseIntVect'> 18446744073709551615
