# ATOMICA-Ligand for annotation of HEM ligands to dark proteome small molecule binding sites

This Jupyter notebook provides an example of how you can run ATOMICA-Ligand to annotate dark proteome small molecule binding sites. To run this notebook you will need
* checkpoints for ATOMICA-Ligand for HEM ligands, which can be downloaded from [Hugging Face](https://huggingface.co/ada-f/ATOMICA/tree/main/ATOMICA_checkpoints/ligand/small_molecules/HEM)
* processed dark proteome small molecule binding sites, which can be downloaded from [Harvard Dataverse](https://doi.org/10.7910/DVN/4DUBJX)

This notebook outputs the predicted score and label for each small molecule binding site in the dark proteome dataset.

In [None]:
import numpy as np
import json
import os
import pandas as pd
import sys
import torch
from tqdm import tqdm

current_directory = os.getcwd()
parent_directory = os.path.dirname(os.path.dirname(current_directory))
sys.path.insert(0, parent_directory)

from data.dataset import PDBDataset
from models.classifier_model import ClassifierModel
from trainers.abs_trainer import Trainer

In [4]:
model1 = ClassifierModel.load_from_config_and_weights(
    "/path/to/ATOMICA/checkpoints/ligand/HEM/HEM_v1_config.json", 
    "/path/to/ATOMICA/checkpoints/ligand/HEM/HEM_v1.pt",
)

model2 = ClassifierModel.load_from_config_and_weights(
    "/path/to/ATOMICA/checkpoints/ligand/HEM/HEM_v2_config.json", 
    "/path/to/ATOMICA/checkpoints/ligand/HEM/HEM_v2.pt",
)

model3 = ClassifierModel.load_from_config_and_weights(
    "/path/to/ATOMICA/checkpoints/ligand/HEM/HEM_v3_config.json", 
    "/path/to/ATOMICA/checkpoints/ligand/HEM/HEM_v3.pt",
)

models = [model1, model2, model3]



In [None]:
dataset = PDBDataset("/path/to/ATOMICA/data/dark_proteome/is_dark_90_plddt_PeSTo_80_small_molecule.jsonl.gz")

In [11]:
batch_size = 16

for model in models:
    model.eval()
    model.to('cuda')

predictions = []
for i in tqdm(range(0, len(dataset), batch_size), total=len(dataset) // batch_size):
    batch = PDBDataset.collate_fn([dataset[j] for j in range(i, min(i + batch_size, len(dataset)))])
    batch = Trainer.to_device(batch, 'cuda')
    
    batch_predictions = []
    for model in models:
        prediction = model.infer(batch).detach().cpu()
        batch_predictions.append(prediction)
    batch_predictions = torch.mean(torch.stack(batch_predictions), dim=0).detach().cpu().numpy()
    predictions.append(batch_predictions)
predictions = np.concatenate(predictions, axis=0)

61it [00:48,  1.25it/s]                        


In [23]:
with open("ATOMICA_ligand_thresholds.json", "r") as f:
    thresholds = json.load(f)

predictions_df = pd.DataFrame(
    {"uniprot_id": dataset.indexes, "HEM_score": predictions, "HEM_annotation": predictions > thresholds["HEM"]}
).sort_values("HEM_score", ascending=False)
predictions_df[predictions_df["HEM_annotation"]]

Unnamed: 0,uniprot_id,HEM_score,HEM_annotation
184,A0A3Q9I286,0.998523,True
444,A0A202DWN7,0.998288,True
518,A0A1G8Z1U8,0.997995,True
301,U2G347,0.996366,True
63,A0A1C6AUF5,0.996321,True
347,A0A1H3K9Y7,0.995171,True
255,A0A0K9IDB2,0.994648,True
32,A0A2W5Y9V8,0.994561,True
611,A0A7J3Z9T9,0.98958,True
294,A0A1L3RA87,0.988407,True
