### Explore AI4LifeScience Tools in OpenBioMed

OpenBioMed implements a suite of AIs tools for accelerating life science research including:
- molecular property prediction
- molecule editing
- text-based denovo molecule generation
- protein function prediction
- protein folding
- denovo protein generation
- protein mutation explanation & engineering
- protein-molecule docking
- structure-based drug design

Feel free to [download](https://cloud.tsinghua.edu.cn/d/5d08f4bc502848dc83bd/) our trained models, put them under `checkpoints/server`, and explore their applications with your own data!

In [1]:
# Change working directory
import os
import sys
parent = os.path.dirname(os.path.abspath(''))
print(parent)
sys.path.append(parent)
os.chdir(parent)

import logging
logging.basicConfig(level=logging.ERROR)

/AIRvePFS/dair/luoyz-data/projects/OpenBioMed/OpenBioMed_arch


In OpenBioMed, we provide a unified interface for deploying ML-based models and performing prediction through `InferencePipeline`. To construct a pipeline, you just need to configure the task, model, path to the trained checkpoint, and which device to deploy the model. 

You can use the pipeline.print_usage() function to identify the inputs and the outputs of the model. To construct appropriate inputs for molecule and protein inputs of the model, please refer to [manipulating_molecules](./manipulating_molecules.ipynb). 

Then, you can pass the inputs to pipeline.run() method to perform prediction. It accepts either single input or multiple inputs. The return value is a tuple, where the first element is a list of the original model outputs, and the second element is a list of metadata for building workflows (which you can simply ignore).

Here we provide examples on two tasks. You can also modify model inputs [here](../open_biomed/scripts/inference.py) and run `python open_biomed/scripts/inference.py --task [TASK_NAME]` to test any task you are interested in.


In [3]:
from open_biomed.core.pipeline import InferencePipeline
from open_biomed.data.molecule import Molecule

# Predict if a molecule can penetrate the blood-brain barrier (https://arxiv.org/abs/1703.00564) with a fine-tuned GraphMVP (https://arxiv.org/abs/2110.07728) model
pipeline = InferencePipeline(
    task="molecule_property_prediction",
    model="graphmvp",
    model_ckpt="./checkpoints/demo/graphmvp-BBBP.ckpt",
    additional_config="./configs/dataset/bbbp.yaml",
    device="cpu"
)
print(pipeline.print_usage())

# Construct molecules via SMILES strings
molecule1 = Molecule.from_smiles("Nc1[nH]c(C(=O)c2ccccc2)c(-c2ccccn2)c1C(=O)c1c[nH]c2ccc(Br)cc12")
molecule2 = Molecule.from_smiles("CN1CCC[C@H]1COC2=NC3=C(CCN(C3)C4=CC=CC5=C4C(=CC=C5)Cl)C(=N2)N6CCN([C@H](C6)CC#N)C(=O)C(=C)F")

# The tool can handle multiple inputs simutaneously
outputs = pipeline.run(
    molecule=[molecule1, molecule2]
)[0]
print(outputs)

Molecular property prediction.
Inputs: {"molecule": a small molecule}
Outputs: A float number in [0, 1] indicating the likeness of the molecule to exhibit certain properties.


Inference Steps: 100%|██████████| 1/1 [00:00<00:00, 174.02it/s]

[[0.582], [0.8478]]





In [4]:
from open_biomed.core.pipeline import InferencePipeline
from open_biomed.data.protein import Protein

# Predict the 3D structure of the protein based on its amino acid sequence using EsmFold (https://www.science.org/doi/10.1126/science.ade2574)
# REMARK: It is recommended to use a GPU with at least 16GB memory to speed up inference. If you don't have a NVIDIA GPU, change the `device` argument to `cpu`.
pipeline = InferencePipeline(
    task="protein_folding",
    model="esmfold",
    model_ckpt="./checkpoints/demo/esmfold.ckpt",
    device="cuda:0"            
)
print(pipeline.print_usage())

# Initialize a protein with an amino acid sequence
protein = Protein.from_fasta("MASDAAAEPSSGVTHPPRYVIGYALAPKKQQSFIQPSLVAQAASRGMDLVPVDASQPLAEQGPFHLLIHALYGDDWRAQLVAFAARHPAVPIVDPPHAIDRLHNRISMLQVVSELDHAADQDSTFGIPSQVVVYDAAALADFGLLAALRFPLIAKPLVADGTAKSHKMSLVYHREGLGKLRPPLVLQEFVNHGGVIFKVYVVGGHVTCVKRRSLPDVSPEDDASAQGSVSFSQVSNLPTERTAEEYYGEKSLEDAVVPPAAFINQIAGGLRRALGLQLFNFDMIRDVRAGDRYLVIDINYFPGYAKMPGYETVLTDFFWEMVHKDGVGNQQEEKGANHVVVK")
outputs = pipeline.run(
    protein=protein,
)
# The output is still a Protein object, but its 3D backbone coordinates are available
# You can find the pdb file or use our [visualization tools](./visualization.ipynb) to inspect the structure
outputs[0][0].save_pdb("./tmp/folded_protein.pdb")

Some weights of EsmForProteinFolding were not initialized from the model checkpoint at /AIRvePFS/dair/users/ailin/.cache/huggingface/hub/esmfold_v1 and are newly initialized: ['esm.contact_head.regression.bias', 'esm.contact_head.regression.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Protein folding prediction.
Inputs: {"protein": a protein sequence}
Outputs: A protein object with 3D structure available.


Inference Steps: 100%|██████████| 1/1 [00:05<00:00,  5.26s/it]


'./tmp/folded_protein.pdb'