Authors: Vishal Dey, Xia Ning
Workshop paper: Accepted in BioKDD '23
Full version: In Progress
This repository provides the source code for the proposed methods:
Operating systems: Red Hat Enterprise Linux (RHEL) 7.7
Install packages under conda environments
conda create -n drugrank python=3.9
conda activate drugrank
conda install -y -c rdkit rdkit=2023.03.3
conda install -y numpy=1.26.0 scipy=1.9.1
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip3 install torch=1.13.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu116
- CCLE: Download CCLE expression data from here and save it as
data/CCLE/CCLE_expression.csv
. - Combined: Download the
combined_rnaseq_data
file with gene expression data for cell lines renamed for CTRPv2 from here and save it inCombined/
. - Please use the provided CTRPv2 (with adjusted AUCs) and PRISM datasets which are already processed. The processed datasets can be downloaded from here. Unzip the
ctrpv2.zip
andprism.zip
inside thedata
directory. - Find detailed instructions to download and process data in
data/README.md
.
- Check
scripts/run_ae.sh
on how to run the pretraining code insrc/train_ae.py
. - Each pre-training run will take only a few minutes on a single V100 GPU.
Run the below code to train
export DATA_FOLDER="data/ctrpv2/"
python src/cross_validate.py --model listone --data_path $DATA_FOLDER/LCO/aucs.txt --smiles_path $DATA_FOLDER/cmpd_smiles.txt --splits_path $DATA_FOLDER/LCO/pletorg/ --pretrained_ae -ae_path ${ae_path} -fgen morgan_count --setup LCO
Run the below code to train
export DATA_FOLDER="data/ctrpv2/"
python src/cross_validate.py --model listall --data_path $DATA_FOLDER/LCO/aucs.txt --smiles_path $DATA_FOLDER/cmpd_smiles.txt --splits_path $DATA_FOLDER/LCO/pletorg/ --pretrained_ae -ae_path ${ae_path} -fgen morgan_count -M 0.5 --setup LCO
Run the below code to train
export DATA_FOLDER="data/ctrpv2/"
python src/cross_validate.py --model pairpushc --data_path $DATA_FOLDER/LCO/aucs.txt --smiles_path $DATA_FOLDER/cmpd_smiles.txt --splits_path $DATA_FOLDER/LCO/pletorg/ --pretrained_ae -ae_path ${ae_path} -classc -fgen morgan_count --setup LCO
where ${ae_path} should be the path to the directory containing the saved models.
-
model
specifies the type of model to train. -
data_path
specifies the file path containing final processed list of cell ID, drug ID and AUC values (comma-separated). -
smiles_path
specifies the file path containing the list of tab-separated drug ID and its SMILES string, the SMILES string must be the last column in this file. -
splits_path
specifies the path to the directory containing the folds, where each fold is saved as a directory. -
ae_path
specifies the path to the directory containing the pretrained$\mathtt{GeneAE}$ model. - check
utils/args.py
for other hyper-parameters. - Use
export DATA_FOLDER="data/ctrpv2/"
and changeae_ind
to 17743 andae_path
todata/Combined/combined_rnaseq_data
for all experiments on CTRP dataset. - Use
export DATA_FOLDER="data/prism/"
and changeae_ind
to 19177 andae_path
todata/CCLE/CCLE_expression.csv
for all experiments on PRISM dataset. - change the
splits_path
to$DATA_FOLDER/LRO/
,data_path
to$DATA_FOLDER/LRO/aucs.txt
andsetup=LRO
for the LRO experiments.
Check the following scripts for hyper-parameter grid-search and cross-validation:
-
scripts/run_listone.sh
for$\mathtt{List\text{-}One}$ . -
scripts/run_listall.sh
for$\mathtt{List\text{-}All}$ . -
scripts/run_pairpushc.sh
for$\mathtt{Pair\text{-}PushC}$ .