# Paper Analyses

Replication of the analyses in the paper: TBC

In [9]:
import os
import sys

sys.path.append(os.path.join("..", "..", "DeepMet", "src"))
from workflow.training import train_likeness_scorer
from utils.feature_processing import get_fingerprints_from_meta


Compounds were extracted from the HMDB and ZINC12 databases, subject to the following constraints:
- Exact mass filter: 100Da < exact mass \> 800Da
- Other things

The entire set of compounds passing these filters in HMDB were retained while a random sample of 20,000 compounds were taken from ZINC12. The smiles for these compounds are available in the `data/test_set` folder.

In [10]:
data_path = "../data/test_set/"

# Location of the "normal" and "non-normal" smiles
normal_meta_path = os.path.join(data_path, "hmdb_meta.csv")
non_normal_meta_path = os.path.join(data_path, "zinc_meta.csv")

# Path to write results
results_path = "../paper_results"

if not os.path.exists(results_path):
    os.mkdir(results_path)

While smiles are provided in the data files, these are not used as input to the model.
If not given to the `train_likeness_scorer` function, these will be converted to molecular
fingerprints using the smiles given as input. For the sake of this vignette, we will calculate
these in the following chunk.

This is a particularly time-consuming step, so it is recommended to supply the pre-calculated fingerprint
matrix for training.

In [11]:
# normal_fingerprints_path = get_fingerprints_from_meta(normal_meta_path, os.path.join(results_path, "normal_fingerprints.csv"))
# non_normal_fingerprints_path = get_fingerprints_from_meta(non_normal_meta_path, os.path.join(results_path, "non_normal_fingerprints.csv"))

normal_fingerprints_path = os.path.join(results_path, "normal_fingerprints.csv")
non_normal_fingerprints_path = os.path.join(results_path, "non_normal_fingerprints.csv")

While we have now generated the molecular fingerprints, these include many poorly balanced and
redundant features. The `train_likeness_scorer` function first calls `select_features` to greatly
reduce the dimensionality of the inputs.

Below, we re-train the model created in the paper.

In [None]:
# Train the model - feature selection is done within the function
deep_met_model = train_likeness_scorer(
    normal_meta_path=normal_meta_path,
    results_path=results_path,
    load_config=False,
    non_normal_meta_path=non_normal_meta_path,
    normal_fingerprints_path=normal_fingerprints_path,
    non_normal_fingerprints_path=non_normal_fingerprints_path,
    net_name="cocrystal_transformer",
    objective="soft-boundary",
    nu=0.1,
    rep_dim=200,
    device="cpu",
    seed=1,
    optimizer_name="amsgrad",
    lr=0.000100095,
    n_epochs=20,
    lr_milestones=tuple(),
    batch_size=2000,
    weight_decay=1e-5,
    pretrain=False,
    validation_split=0.8,
    test_split=0.9
)

INFO:root:Log file is ../paper_results/log.txt.
INFO:root:Export path is ../paper_results.
INFO:root:Network: cocrystal_transformer
INFO:root:The filtered normal fingerprint matrix path is ../paper_results\normal_fingerprints_processed.csv.
INFO:root:The filtered normal meta is ../data/test_set/hmdb_meta.csv.
INFO:root:The filtered non-normal fingerprint matrix path is ../paper_results\non_normal_fingerprints_processed.csv.
INFO:root:The filtered non-normal meta is ../data/test_set/zinc_meta.csv.
INFO:root:Deep SVDD objective: soft-boundary
INFO:root:Nu-parameter: 0.10
INFO:root:Computation device: cpu
