The most efficient way to store and interact with fingerprints is through the e3fp.fingerprint.db.FingerprintDatabase class. This class wraps a matrix with sparse rows (scipy.sparse.csr_matrix), where each row is a fingerprint. This enables rapid I/O of the database while also minimizing the memory footprint. Accessing the underlying sparse representation with the .FingerprintDatabase.array
attribute is convenient for machine learning purposes, while the database class itself provides several useful functions.
Note
We strongly recommend upgrading to at least SciPy v1.0.0 when working with large fingerprint databases, as old versions are much slower and have several bugs for database loading.
See the full e3fp.fingerprint.db.FingerprintDatabase documentation for a description of basic database usage, attributes, and methods. Below, several additional use cases are documented.
Due to the sparse representation of the underlying data structure, an un-folded database, a database with unfolded fingerprints does not use significantly more disk space than a database with folded fingerprints. However, it is usually necessary to fold fingerprints for machine learning tasks. The :py.FingerprintDatabase
does this very quickly.
import numpy as np np.random.seed(3)
>>> from e3fp.fingerprint.db import FingerprintDatabase >>> from e3fp.fingerprint.fprint import Fingerprint >>> import numpy as np >>> db = FingerprintDatabase(fp_type=Fingerprint, name="TestDB") >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: None, fp_num: 0] >>> on_inds = [np.random.uniform(0, 2*32, size=30) for i in range(5)] >>> fps = [Fingerprint(x, bits=2*32) for x in on_inds] >>> db.add_fingerprints(fps) >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5] >>> db.get_density() 6.984919309616089e-09 >>> fold_db = db.fold(1024) >>> print(fold_db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 5] >>> fold_db.get_density() 0.0287109375
A database can be converted to a different fingerprint type:
>>> from e3fp.fingerprint.fprint import CountFingerprint >>> count_db = db.as_type(CountFingerprint) >>> print(count_db) FingerprintDatabase[name: TestDB, fp_type: CountFingerprint, level: -1, bits: 4294967296, fp_num: 5] >>> count_db[0] CountFingerprint(counts={2977004690: 1, ..., 3041471738: 1}, level=-1, bits=4294967296, name=None)
The e3fp.fingerprint.db.concat method allows efficient joining of multiple databases.
>>> from e3fp.fingerprint.db import concat >>> dbs = [] >>> for i in range(10): ... db = FingerprintDatabase(fp_type=Fingerprint) ... on_inds = [np.random.uniform(0, 1024, size=30) for j in range(5)] ... fps = [Fingerprint(x, bits=2**32, name="Mol{}".format(i)) for x in on_inds] ... db.add_fingerprints(fps) ... dbs.append(db) >>> dbs[0][0] Fingerprint(indices=array([94, 97, ..., 988, 994]), level=-1, bits=4294967296, name=Mol0) >>> print(dbs[0]) FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5] >>> merge_db = concat(dbs) >>> print(merge_db) FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 50]
Two databases may be compared using various metrics in e3fp.fingerprint.metrics. Additionally, all fingerprints in a database may be compared to each other simply by only providing a single database. See Fingerprint Comparison
for more details.
The underlying sparse matrix may be passed directly to machine learning tools in any package that is compatible with SciPy sparse matrices, such as scikit-learn.
>>> from sklearn.naive_bayes import BernoulliNB >>> clf = BernoulliNB() >>> clf.fit(db.array, ypred) # doctest: +SKIP BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True) >>> clf.predict(db2.array) # doctest: +SKIP ...