# Other interesting features

Scikit-fingerprints contains many other larger and smaller features, two of which we cover here: working with peptides and checking applicability domain (AD).

## Peptides

Scikit-fingerprints enables the user to work with peptide data. In our [paper]() we use scikit-fingerprints to show that molecular fingerprints perform very well on peptides.

To process peptide data we first create a list of aminoacid sequences that follow the FASTA format. FASTA consists of header and aminoacid sequence. We don't need the header for processing.

The code below creates lists of sequences and arrays of labels for training and testing set from Xu_AMP benchmark

In [1]:
import numpy as np

def extract_fasta(path):
    aminoseq = []
    with open(path) as f:
        for i, line in enumerate(f.readlines()[1:]):
            if i % 2 == 1:
                continue
            aminoseq.append(line.strip())
    return aminoseq

train_positive = extract_fasta("data/Xu_AMP/train_positive.fasta")
train_negative = extract_fasta("data/Xu_AMP/train_negative.fasta")
test_positive = extract_fasta("data/Xu_AMP/test_positive.fasta")
test_negative = extract_fasta("data/Xu_AMP/test_negative.fasta")

labels_train = np.array([1] * len(train_positive) + [0] * len(train_negative))
labels_test = np.array([1] * len(test_positive) + [0] * len(test_negative))

aminoseq_train = train_positive + train_negative
aminoseq_test = test_positive + test_negative

In [2]:
from skfp.preprocessing import MolFromAminoseqTransformer


mol_from_seq = MolFromAminoseqTransformer(n_jobs=-1)

mols_train = mol_from_seq.transform(aminoseq_train)
mols_test = mol_from_seq.transform(aminoseq_test)

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from skfp.fingerprints import ECFPFingerprint, TopologicalTorsionFingerprint, RDKitFingerprint


fingerprints = [
    ECFPFingerprint,
    TopologicalTorsionFingerprint,
    RDKitFingerprint
]

for fp_class in fingerprints:
    for count in [False, True]:
        fp_transformer = fp_class(n_jobs=-1, count=count)

        fps_train = fp_transformer.transform(mols_train)
        fps_test = fp_transformer.transform(mols_test)

        auroc_scores = []
        for i in range(5):
            clf = RandomForestClassifier(n_jobs=-1,random_state=i)
            clf.fit(fps_train, labels_train)

            labels_proba = clf.predict_proba(fps_test)[:,1]

            auroc = roc_auc_score(labels_test, labels_proba)
            auroc_scores.append(auroc)

        auroc_mean = np.mean(auroc_scores)
        auroc_std = np.std(auroc_scores)

        use_count = "count" if count else "bit"
        print(f"AUROC for {fp_class.__name__} in {use_count} variant: {auroc_mean:.2%} +- {auroc_std:.2%}")

AUROC for ECFPFingerprint in bit variant: 69.67% +- 0.15%
AUROC for ECFPFingerprint in count variant: 73.44% +- 0.21%
AUROC for TopologicalTorsionFingerprint in bit variant: 71.62% +- 0.17%
AUROC for TopologicalTorsionFingerprint in count variant: 71.45% +- 0.33%
AUROC for RDKitFingerprint in bit variant: 64.96% +- 0.12%
AUROC for RDKitFingerprint in count variant: 72.64% +- 0.18%


## Applicability domain checks