# Other interesting features

Scikit-fingerprints contains many other larger and smaller features, two of which we cover here: working with peptides and checking applicability domain (AD).

## Peptides

Scikit-fingerprints enables the user to work with peptide data. In our [paper](https://arxiv.org/abs/2501.17901) we use scikit-fingerprints to show that molecular fingerprints perform very well on peptides.

To process peptide data we first create a list of aminoacid sequences that follow the FASTA format. FASTA consists of header and aminoacid sequence. We don't need the header for processing.

The code below creates lists of sequences and arrays of labels for training and testing set from Xu_AMP benchmark.

In [1]:
import numpy as np

def extract_fasta(path):
    aminoseq = []
    with open(path) as f:
        for i, line in enumerate(f.readlines()[1:]):
            if i % 2 == 1:
                continue
            aminoseq.append(line.strip())
    return aminoseq

train_positive = extract_fasta("data/Xu_AMP/train_positive.fasta")
train_negative = extract_fasta("data/Xu_AMP/train_negative.fasta")
test_positive = extract_fasta("data/Xu_AMP/test_positive.fasta")
test_negative = extract_fasta("data/Xu_AMP/test_negative.fasta")

labels_train = np.array([1] * len(train_positive) + [0] * len(train_negative))
labels_test = np.array([1] * len(test_positive) + [0] * len(test_negative))

aminoseq_train = train_positive + train_negative
aminoseq_test = test_positive + test_negative

**Exercise 1**

Transform aminosequences into molecules using `MolFromAminoseqTransformer` like we used `MolFromSmilesTransformer` before

In [2]:
from skfp.preprocessing import MolFromAminoseqTransformer


mol_from_seq = MolFromAminoseqTransformer(n_jobs=-1)

mols_train = mol_from_seq.transform(aminoseq_train)
mols_test = mol_from_seq.transform(aminoseq_test)

**Exercise 2**

Let's assess the performance of very local molecular fingerprints on peptides.

- Iterate over fingerprints checking both bit and count variant of each of the fingerprints.
- For each variant, transform training and testing molecules using the fingerprint.
- Train `RandomForestClassifier` 5 times with different random states and score auroc on testing set for each repeat
- Report mean and std of auroc for each variant of each fingerprint

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from skfp.fingerprints import ECFPFingerprint, TopologicalTorsionFingerprint, RDKitFingerprint


fingerprints = [
    ECFPFingerprint,
    TopologicalTorsionFingerprint,
    RDKitFingerprint
]

for fp_class in fingerprints:
    for count in [False, True]:
        fp_transformer = fp_class(n_jobs=-1, count=count)

        fps_train = fp_transformer.transform(mols_train)
        fps_test = fp_transformer.transform(mols_test)

        auroc_scores = []
        for i in range(5):
            clf = RandomForestClassifier(n_jobs=-1,random_state=i)
            clf.fit(fps_train, labels_train)

            labels_proba = clf.predict_proba(fps_test)[:,1]

            auroc = roc_auc_score(labels_test, labels_proba)
            auroc_scores.append(auroc)

        auroc_mean = np.mean(auroc_scores)
        auroc_std = np.std(auroc_scores)

        use_count = "count" if count else "bit"
        print(f"AUROC for {fp_class.__name__} in {use_count} variant: {auroc_mean:.2%} +- {auroc_std:.2%}")

AUROC for ECFPFingerprint in bit variant: 69.67% +- 0.15%
AUROC for ECFPFingerprint in count variant: 73.44% +- 0.21%
AUROC for TopologicalTorsionFingerprint in bit variant: 71.62% +- 0.17%
AUROC for TopologicalTorsionFingerprint in count variant: 71.45% +- 0.33%
AUROC for RDKitFingerprint in bit variant: 64.96% +- 0.12%
AUROC for RDKitFingerprint in count variant: 72.64% +- 0.18%


## Applicability domain checks

Applicability domains can be demonstrated by applying them to two distinct datasets.

We'll use BBBP dataset from MoleculeNet benchmark and ApisTox dataset from [this paper](http://doi.org/10.1038/s41597-024-04232-w)

As ApisTox contains pesticide molecules bioactive on honey bees we expect the data to differ significantly from BBBP which contains molecules active on human blood-brain barier.

In [4]:
from skfp.datasets.moleculenet import load_bbbp
import pandas as pd

bbbp_smiles = load_bbbp()[0]

apistox_smiles = pd.read_csv("data/ApisTox/dataset_final.csv")["SMILES"]

In [5]:
from skfp.preprocessing import MolFromSmilesTransformer

mol_from_smiles = MolFromSmilesTransformer(n_jobs=-1)

bbbp_mols = mol_from_smiles.transform(bbbp_smiles)
apistox_mols = mol_from_smiles.transform(apistox_smiles)

ecfp = ECFPFingerprint(n_jobs=-1, count=True)

bbbp_fps = ecfp.transform(bbbp_smiles)
apistox_fps = ecfp.transform(apistox_smiles)

In [6]:
from skfp.distances import bulk_dice_count_similarity, bulk_tanimoto_count_similarity
import numpy as np

similarity_metrics = [bulk_dice_count_similarity, bulk_tanimoto_count_similarity]

for metric, name in zip(similarity_metrics,["dice", "Tanimoto"]):
    bbbp_self_sim = metric(bbbp_fps)
    apistox_self_sim = metric(apistox_fps)
    cross_dataset_sim = metric(bbbp_fps, apistox_fps)

    bbbp_self_mean = bbbp_self_sim[np.triu_indices_from(bbbp_self_sim, k=1)].mean()
    apistox_self_mean = apistox_self_sim[np.triu_indices_from(apistox_self_sim, k=1)].mean()
    cross_dataset_mean = cross_dataset_sim.mean()

    print(f"average bbbp similarity for {name}                 : {bbbp_self_mean:.4}")
    print(f"average ApisTox similarity for {name}              : {apistox_self_mean:.4}")
    print(f"average similarity between two datasets for {name} : {cross_dataset_mean:.4}\n")

average bbbp similarity for dice                 : 0.3817
average ApisTox similarity for dice              : 0.2807
average similarity between two datasets for dice : 0.3031

average bbbp similarity for Tanimoto                 : 0.253
average ApisTox similarity for Tanimoto              : 0.1792
average similarity between two datasets for Tanimoto : 0.193



In [8]:
from skfp.applicability_domain import BoundingBoxADChecker, ConvexHullADChecker

ad_checkers = [BoundingBoxADChecker, ConvexHullADChecker]

for ad_checker, name in zip(ad_checkers, ["Bounding box", "Convex hull"]):
    print(name)

    checker_bbbp = ad_checker(n_jobs=-1, verbose=True)
    checker_apistox = ad_checker(n_jobs=-1, verbose=True)

    print("Fitting ad for BBBP")
    checker_bbbp.fit(bbbp_fps)
    print("Predicting ApisTox on BBBP ad")
    apistox_in_bbbp = checker_bbbp.predict(apistox_fps).astype(int).mean()

    print(f"{apistox_in_bbbp:.2%} ApisTox molecules withing BBBP ad")

    print("Fitting ad for ApisTox")
    checker_apistox.fit(apistox_fps)
    print("Predicting BBBP on ApisTox ad")
    bbbp_in_apistox = checker_apistox.predict(bbbp_fps).astype(int).mean()

    print(f"{bbbp_in_apistox:.2%} BBBP molecules withing ApisTox ad\n")


Bounding box
Fitting ad for BBBP
Predicting ApisTox on BBBP ad
57.49% ApisTox molecules withing BBBP ad
Fitting ad for ApisTox
Predicting BBBP on ApisTox ad
42.72% BBBP molecules withing ApisTox ad

Convex hull
Fitting ad for BBBP
Predicting ApisTox on BBBP ad
0.29% ApisTox molecules withing BBBP ad
Fitting ad for ApisTox
Predicting BBBP on ApisTox ad
0.10% BBBP molecules withing ApisTox ad

