# Usage of DeepEI

## Preparing the dataset for training the fingerprint prediction model

This repository does not include the data exported from NIST 2017, which is used in the paper. The users need prepare the data themselves. If they do not have the NIST 2017, MoNA should be an alternative options. Here we assume that a list of SMILES and the corresponding is available.

In [10]:
import numpy as np
from scipy.sparse import csr_matrix, save_npz
from DeepEI.utils import ms2vec, fp2vec, get_cdk_fingerprints

In [11]:
all_smiles = ["CCOP(C)(=O)OP(C)(=S)OCC", "C[Si](C)(C)NC(=O)N1c2ccccc2CC(O[Si](C)(C)C)c2ccccc21", "O=C(C(Br)C(Br)c1ccccc1)C(Br)C(Br)c1ccccc1"]
all_spectra = [[[55, 70, 145, 255], [23, 999, 344, 77]], [[58, 75, 233, 259], [23, 566, 304, 999]], [[15, 88, 170, 335], [15, 99, 999, 664]]] # not real spectra

In [12]:
# calculate fingerprints
fp_vecs = np.array([get_cdk_fingerprints(s) for s in all_smiles])

# translate peaks into vectors
peak_vecs = np.array([ms2vec(s[0], s[1]) for s in all_spectra])

In [13]:
# translate into sparse matrix and save
fp_vecs = csr_matrix(fp_vecs)
peak_vecs = csr_matrix(peak_vecs)
# save_npz('DeepEI/data/peakvec.npz', peak_vecs)
# save_npz('DeepEI/data/fingerprints.npz', fp_vecs)

## Training the model

With the dataset, the users can train the MLP models for fingerprint prediction. The function is called directly from *Fingerprint/mlp.py*. If they want to tune the parameters of the model, they can edit the file.

In [14]:
from scipy.sparse import load_npz
from Fingerprint.mlp import MLP

In [15]:
spec = load_npz('DeepEI/data/peakvec.npz')
fps = load_npz('DeepEI/data/fingerprints.npz')
spec = spec.todense()
fps = csr_matrix(fps)

In [None]:
# training the model one-by-one, it will be time-consuming
for i in tqdm(range(fps.shape[1])):
    y = fps[:,i].todense()
    y = np.squeeze(np.asarray(y))
    
    # check: 0.1 < bias < 0.9
    fr = np.sum(y) / len(y)
    if (fr < 0.1) or (fr > 0.9):
        continue
    Y = np.vstack((y, (1-y))).transpose()
    
    # for write the evaluation results 
    mlp_result = open('Fingerprint/results/mlp_result.txt', 'a+')
    
    mlp = MLP(spec, Y)
    mlp.train()
    mlp_res = mlp.test()
    mlp_result.write("\t".join([str(i)] + [str(j) for j in mlp_res]))
    mlp_result.write("\n")
    mlp.save('Fingerprint/mlp_models/{}.h5'.format(i))

## Predicting the fingerprints via the spectra

With the model, if given unknown spectra, the fingerprints can be predicted

In [21]:
import numpy as np
import pandas as pd
from DeepEI.predict import predict_fingerprint

In [22]:
mlp = pd.read_csv('Fingerprint/results/mlp_result.txt', sep='\t', header=None)
mlp.columns = ['id', 'accuracy', 'precision', 'recall', 'f1']
fpkeep = mlp['id'][np.where(mlp['f1'] > 0.5)[0]] # only keep the model with F1>0.5

In [24]:
unknown_spectra = [[[55, 70, 145, 255], [23, 999, 344, 77]], [[58, 75, 233, 259], [23, 566, 304, 999]], [[15, 88, 170, 335], [15, 99, 999, 664]]] # not real spectra
unknown_peak_vecs = np.array([ms2vec(s[0], s[1]) for s in unknown_spectra])
pred_fps = predict_fingerprint(unknown_peak_vecs, fpkeep) 

100%|████████████████████████████████████████████████████████████████████████████████| 633/633 [03:54<00:00,  2.70it/s]


## Score the candidates

Here the user should give the candidates of the unknown compounds. DeepEI will score the candidates based on the fingerprints similarity.

In [30]:
from DeepEI.utils import get_fp_score

In [31]:
candidate_smiles = ["CCOP(C)(=O)OP(C)(=S)OCC", "C[Si](C)(C)NC(=O)N1c2ccccc2CC(O[Si](C)(C)C)c2ccccc21", "O=C(C(Br)C(Br)c1ccccc1)C(Br)C(Br)c1ccccc1"]
candidate_fps = np.array([get_cdk_fingerprints(s) for s in all_smiles])
candidate_fps = candidate_fps[:, fpkeep] # only keep the fingerprints with the prediction model

In [32]:
pred_fp = pred_fps[0] # choose the first unknown compound
scores = get_fp_score(pred_fp, candidate_fps)

In [33]:
print(scores)

[0.71406003 0.5971564  0.66824645]
