-
Notifications
You must be signed in to change notification settings - Fork 32
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
373 additions
and
1 deletion.
There are no files selected for viewing
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
Fingerprint Comparison | ||
====================== | ||
|
||
The `e3fp.fingerprint.metrics` sub-package provides several useful methods for | ||
batch comparison of fingerprints in various representations. | ||
|
||
Fingerprint Metrics | ||
------------------- | ||
|
||
These metrics operate directly on pairs of :py:class:`.Fingerprint` and | ||
:py:class:`.FingerprintDatabase` objects or on a combination of each. If | ||
only a single variable is specified, self-comparison is performed. The | ||
implemented methods are common functions for fingerprint similarity in the | ||
literature. | ||
|
||
.. todo:: | ||
|
||
Document examples | ||
|
||
Array Metrics | ||
------------- | ||
|
||
To efficiently compare fingerprint databases above, we provide comparison | ||
metrics that can operate directly on the internal sparse matrix representation | ||
without the need to "densify it". We describe these here, as they have several | ||
additional features. | ||
|
||
The array metrics implemented in `e3fp.fingerprint.metrics.array_metrics` are | ||
implemented such that they may take any combination of dense and sparse inputs. | ||
Additionally, they are designed to function as | ||
`scikit-learn-compatible kernels <http://scikit-learn.org/stable/modules/metrics.html>`_ | ||
for machine learning tasks. For example, one might perform an analysis using a | ||
support vector machine (SVM) and Tanimoto kernel. | ||
|
||
.. code:: python | ||
>>> from sklearn.svm import SVC | ||
>>> from e3fp.fingerprint.metrics.array_metrics import tanimoto | ||
>>> clf = SVC(kernel=tanimoto) | ||
>>> clf.fit(X, y) | ||
... | ||
>>> clf.predict(test) | ||
... | ||
Most common fingerprint comparison metrics only apply to binary fingerprints. | ||
We include several that operate equally well on count- and float-based | ||
fingerprints. For example, to our knowledge, we provide the only open source | ||
implementation of Soergel similarity, the analog to the Tanimoto coefficient | ||
for non-binary fingerprints that can efficiently operate on sparse inputs. | ||
|
||
.. code:: python | ||
>>> from e3fp.fingerprint.metrics.array_metrics import soergel | ||
>>> clf = SVC(kernel=soergel) | ||
>>> clf.fit(X, y) | ||
... | ||
>>> clf.predict(test) | ||
... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
Fingerprints | ||
============ | ||
|
||
The simplest interface for molecular fingerprints are through three classes in | ||
`e3fp.fingerprint.fprint`: | ||
|
||
:py:class:`.Fingerprint` | ||
a fingerprint with "on" bits | ||
|
||
:py:class:`.CountFingerprint` | ||
a fingerprint with counts for each "on" bit | ||
|
||
:py:class:`.FloatFingerprint` | ||
a fingerprint with float values for each "on" bit, generated for example by | ||
averaging conformer fingerprints. | ||
|
||
In addition to storing "on" indices and, for the latter two, corresponding | ||
values, they store fingerprint properties, such as name, level, and any | ||
arbitrary property. They also provide simple interfaces for fingerprint | ||
comparison, some basic processing, and comparison. | ||
|
||
.. note:: Many of these operations are more efficient when operating on a | ||
:py:class:`.FingerprintDatabase`. See :ref:`Fingerprint Storage` for more | ||
information. | ||
|
||
In the below examples, we will focus on :py:class:`.Fingerprint` and | ||
:py:class:`.CountFingerprint`. First, we execute the necessary imports. | ||
|
||
.. testsetup:: | ||
|
||
import numpy as np | ||
np.random.seed(0) | ||
|
||
.. doctest:: | ||
|
||
>>> from e3fp.fingerprint.fprint import Fingerprint, CountFingerprint | ||
>>> import numpy as np | ||
|
||
.. seealso:: | ||
|
||
:ref:`Fingerprint Storage`, :ref:`Fingerprint Comparison` | ||
|
||
Creation and Conversion | ||
----------------------- | ||
|
||
Here we create a bit-fingerprint with random "on" indices. | ||
|
||
>>> bits = 2**32 | ||
>>> indices = np.sort(np.random.randint(0, bits, 30)) | ||
>>> indices | ||
array([ 243580376, 305097549, ..., 3975407269, 4138900056]) | ||
>>> fp1 = Fingerprint(indices, bits=bits, level=0) | ||
>>> fp1 | ||
Fingerprint(indices=array([243580376, ..., 4138900056]), level=0, bits=4294967296, name=None) | ||
|
||
This fingerprint is extremely sparse | ||
|
||
>>> fp1.bit_count | ||
30 | ||
>>> fp1.density | ||
6.984919309616089e-09 | ||
|
||
We can therefore "fold" the fingerprint through a series of bitwise "OR" | ||
operations on halves of the sparse vector until it is of a specified length, | ||
with minimal collision of bits. | ||
|
||
>>> fp_folded = fp1.fold(1024) | ||
>>> fp_folded | ||
Fingerprint(indices=array([9, 70, ..., 845, 849]), level=0, bits=1024, name=None) | ||
>>> fp_folded.bit_count | ||
29 | ||
>>> fp_folded.density | ||
0.0283203125 | ||
|
||
A :py:class:`.CountFingerprint` may be created by also providing a dictionary | ||
matching indices with nonzero counts to the counts. | ||
|
||
>>> indices2 = np.sort(np.random.randint(0, bits, 60)) | ||
>>> counts = dict(zip(indices2, np.random.randint(1, 10, indices2.size))) | ||
>>> counts | ||
{80701568: 8, 580757632: 7, ..., 800291326: 5, 4057322111: 7} | ||
>>> cfp1 = CountFingerprint(counts=counts, bits=bits, level=0) | ||
>>> cfp1 | ||
CountFingerprint(counts={80701568: 8, 580757632: 7, ..., 3342157822: 2, 4057322111: 7}, level=0, bits=4294967296, name=None) | ||
|
||
Unlike folding a bit fingerprint, by default, folding a count fingerprint | ||
performs a "SUM" operation on colliding counts. | ||
|
||
>>> cfp1.bit_count | ||
60 | ||
>>> cfp_folded = cfp1.fold(1024) | ||
>>> cfp_folded | ||
CountFingerprint(counts={128: 15, 257: 4, ..., 1022: 2, 639: 7}, level=0, bits=1024, name=None) | ||
>>> cfp_folded.bit_count | ||
57 | ||
|
||
It is trivial to interconvert the fingerprints. | ||
|
||
>>> cfp_folded2 = CountFingerprint.from_fingerprint(fp_folded) | ||
>>> cfp_folded2 | ||
CountFingerprint(counts={9: 1, 87: 1, ..., 629: 1, 763: 1}, level=0, bits=1024, name=None) | ||
>>> cfp_folded2.indices[:5] | ||
array([ 9, 70, 72, 87, 174]) | ||
>>> fp_folded.indices[:5] | ||
array([ 9, 70, 72, 87, 174]) | ||
|
||
RDKit Morgan fingerprints (analogous to ECFP) may easily be converted to a | ||
:py:class:`.Fingerprint`. | ||
|
||
>>> from rdkit import Chem | ||
>>> from rdkit.Chem import AllChem | ||
>>> mol = Chem.MolFromSmiles('Cc1ccccc1') | ||
>>> mfp = AllChem.GetMorganFingerprintAsBitVect(mol, 2) | ||
>>> mfp | ||
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...> | ||
>>> Fingerprint.from_rdkit(mfp) | ||
Fingerprint(indices=array([389, 1055, ..., 1873, 1920]), level=-1, bits=2048, name=None) | ||
|
||
Likewise, :py:class:`.Fingerprint` can be easily converted to a NumPy ndarray or | ||
SciPy sparse matrix. | ||
|
||
>>> fp_folded.to_vector() | ||
<1x1024 sparse matrix of type '<type 'numpy.bool_'>' | ||
...with 29 stored elements in Compressed Sparse Row format> | ||
>>> fp_folded.to_vector(sparse=False) | ||
array([False, False, False, ..., False, False, False], dtype=bool) | ||
>>> np.where(fp_folded.to_vector(sparse=False))[0] | ||
array([ 9, 70, 72, 87, ...]) | ||
>>> cfp_folded.to_vector(sparse=False) | ||
array([0, 0, 0, ..., 0, 2, 0], dtype=uint16) | ||
>>> cfp_folded.to_vector(sparse=False).sum() | ||
252 | ||
|
||
Algebra | ||
------- | ||
|
||
Basic algebraic functions may be performed on fingerprints. If either | ||
fingerprint is a bit fingerprint, all algebraic functions are bit-wise. | ||
The following bit-wise operations are supported: | ||
|
||
Equality | ||
>>> fp1 = Fingerprint([0, 1, 6, 8, 12], bits=16) | ||
>>> fp2 = Fingerprint([1, 2, 4, 8, 11, 12], bits=16) | ||
>>> fp1 == fp2 | ||
False | ||
>>> fp1_copy = Fingerprint.from_fingerprint(fp1) | ||
>>> fp1 == fp1_copy | ||
True | ||
>>> fp1_copy.level = 5 | ||
>>> fp1 == fp1_copy | ||
False | ||
|
||
Union/OR | ||
>>> fp1 + fp2 | ||
Fingerprint(indices=array([0, 1, 2, 4, 6, 8, 11, 12]), level=-1, bits=16, name=None) | ||
>>> fp1 | fp2 | ||
Fingerprint(indices=array([0, 1, 2, 4, 6, 8, 11, 12]), level=-1, bits=16, name=None) | ||
|
||
Intersection/AND | ||
>>> fp1 & fp2 | ||
Fingerprint(indices=array([1, 8, 12]), level=-1, bits=16, name=None) | ||
|
||
Difference/AND NOT | ||
>>> fp1 - fp2 | ||
Fingerprint(indices=array([0, 6]), level=-1, bits=16, name=None) | ||
>>> fp2 - fp1 | ||
Fingerprint(indices=array([2, 4, 11]), level=-1, bits=16, name=None) | ||
|
||
XOR | ||
>>> fp1 ^ fp2 | ||
Fingerprint(indices=array([0, 2, 4, 6, 11]), level=-1, bits=16, name=None) | ||
|
||
With count or float fingerprints, bit-wise operations are still possible, but | ||
algebraic operations are applied to counts. | ||
|
||
>>> fp1 = CountFingerprint(counts={0: 3, 1: 2, 5: 1, 9: 3}, bits=16) | ||
>>> fp2 = CountFingerprint(counts={1: 2, 5: 2, 7: 3, 10: 7}, bits=16) | ||
>>> fp1 + fp2 | ||
CountFingerprint(counts={0: 3, 1: 4, 5: 3, 7: 3, 9: 3, 10: 7}, level=-1, bits=16, name=None) | ||
>>> fp1 - fp2 | ||
CountFingerprint(counts={0: 3, 1: 0, 5: -1, 7: -3, 9: 3, 10: -7}, level=-1, bits=16, name=None) | ||
>>> fp1 * 3 | ||
CountFingerprint(counts={0: 9, 1: 6, 5: 3, 9: 9}, level=-1, bits=16, name=None) | ||
>>> fp1 / 2 | ||
FloatFingerprint(counts={0: 1.5, 1: 1.0, 5: 0.5, 9: 1.5}, level=-1, bits=16, name=None) | ||
|
||
Finally, fingerprints may be batch added and averaged, producing either a count | ||
or float fingerprint when sensible. | ||
|
||
>>> from e3fp.fingerprint.fprint import add, mean | ||
>>> fps = [Fingerprint(np.random.randint(0, 32, 8), bits=32) for i in range(100)] | ||
>>> add(fps) | ||
CountFingerprint(counts={0: 23, 1: 23, ..., 30: 20, 31: 14}, level=-1, bits=32, name=None) | ||
>>> mean(fps) | ||
FloatFingerprint(counts={0: 0.23, 1: 0.23, ..., 30: 0.2, 31: 0.14}, level=-1, bits=32, name=None) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
Using Fingerprints | ||
================== | ||
|
||
While molecular fingerprints are widely used, few packages provide simple | ||
interfaces for working with them and interfacing with machine learning | ||
packages. E3FP provides a number of general utility classes and methods for | ||
doing precisely this. | ||
|
||
.. toctree:: | ||
fprints | ||
storage | ||
comparison |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,109 @@ | ||
Fingerprint Storage | ||
=================== | ||
=================== | ||
|
||
The most efficient way to store and interact with fingerprints is through the | ||
`e3fp.fingerprint.db.FingerprintDatabase` class. This class wraps a matrix with | ||
sparse rows (`scipy.sparse.csr_matrix`), where each row is a fingerprint. This | ||
enables rapid I/O of the database while also minimizing the memory footprint. | ||
Accessing the underlying sparse representation with the | ||
:ref:`.FingerprintDatabase.array` attribute is convenient for machine learning | ||
purposes, while the database class itself provides several useful functions. | ||
|
||
.. note:: | ||
|
||
We strongly recommend upgrading to at least SciPy v1.0.0 when working with | ||
large fingerprint databases, as old versions are much slower and have | ||
several bugs for database loading. | ||
|
||
|
||
Database I/O and Indexing | ||
------------------------- | ||
|
||
See the full `e3fp.fingerprint.db.FingerprintDatabase` documentation for a | ||
description of basic database usage, attributes, and methods. Below, several | ||
additional use cases are documented. | ||
|
||
Batch Database Operations | ||
------------------------- | ||
|
||
Due to the sparse representation of the underlying data structure, an un- | ||
folded database, a database with unfolded fingerprints does not use | ||
significantly more disk space than a database with folded fingerprints. However, | ||
it is usually necessary to fold fingerprints for machine learning tasks. The | ||
:py:class:`.FingerprintDatabase` does this very quickly. | ||
|
||
.. testsetup:: | ||
|
||
import numpy as np | ||
np.random.seed(3) | ||
|
||
.. doctest:: | ||
|
||
>>> from e3fp.fingerprint.db import FingerprintDatabase | ||
>>> from e3fp.fingerprint.fprint import Fingerprint | ||
>>> import numpy as np | ||
>>> db = FingerprintDatabase(fp_type=Fingerprint, name="TestDB") | ||
>>> print(db) | ||
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: None, fp_num: 0] | ||
>>> on_inds = [np.random.uniform(0, 2**32, size=30) for i in range(5)] | ||
>>> fps = [Fingerprint(x, bits=2**32) for x in on_inds] | ||
>>> db.add_fingerprints(fps) | ||
>>> print(db) | ||
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5] | ||
>>> db.get_density() | ||
6.984919309616089e-09 | ||
>>> fold_db = db.fold(1024) | ||
>>> print(fold_db) | ||
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 5] | ||
>>> fold_db.get_density() | ||
0.0287109375 | ||
|
||
A database can be converted to a different fingerprint type: | ||
|
||
>>> from e3fp.fingerprint.fprint import CountFingerprint | ||
>>> count_db = db.as_type(CountFingerprint) | ||
>>> print(count_db) | ||
FingerprintDatabase[name: TestDB, fp_type: CountFingerprint, level: -1, bits: 4294967296, fp_num: 5] | ||
>>> count_db[0] | ||
CountFingerprint(counts={2977004690: 1, ..., 3041471738: 1}, level=-1, bits=4294967296, name=None) | ||
|
||
The `e3fp.fingerprint.db.concat` method allows efficient joining of multiple | ||
databases. | ||
|
||
>>> from e3fp.fingerprint.db import concat | ||
>>> dbs = [] | ||
>>> for i in range(10): | ||
... db = FingerprintDatabase(fp_type=Fingerprint) | ||
... on_inds = [np.random.uniform(0, 1024, size=30) for j in range(5)] | ||
... fps = [Fingerprint(x, bits=2**32, name="Mol{}".format(i)) for x in on_inds] | ||
... db.add_fingerprints(fps) | ||
... dbs.append(db) | ||
>>> dbs[0][0] | ||
Fingerprint(indices=array([94, 97, ..., 988, 994]), level=-1, bits=4294967296, name=Mol0) | ||
>>> print(dbs[0]) | ||
FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5] | ||
>>> merge_db = concat(dbs) | ||
>>> print(merge_db) | ||
FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 50] | ||
|
||
Database Comparison | ||
------------------- | ||
|
||
Two databases may be compared using various metrics in | ||
`e3fp.fingerprint.metrics`. Additionally, all fingerprints in a database may be | ||
compared to each other simply by only providing a single database. | ||
See :ref:`Fingerprint Comparison` for more details. | ||
|
||
Performing Machine Learning on the Database | ||
------------------------------------------- | ||
|
||
The underlying sparse matrix may be passed directly to machine learning tools | ||
in any package that is compatible with SciPy sparse matrices, such as | ||
`scikit-learn <http://scikit-learn.org/>`_. | ||
|
||
>>> from sklearn.naive_bayes import BernoulliNB | ||
>>> clf = BernoulliNB() | ||
>>> clf.fit(db.array, ypred) # doctest: +SKIP | ||
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True) | ||
>>> clf.predict(db2.array) # doctest: +SKIP | ||
... |