CoPRA is a state-of-the-art predictor of protein-RNA binding affinity. The framework of CoPRA is based on a protein language model and an RNA-language model, with complex structure as input. The model was pre-trained on the PRI30k dataset via a bi-scope stratege and fine-tuned on PRA310. CoPRA can also be redirected to predict mutation effects, showing its strong per-structure prediction performance on mCSM_RNA dataset. Please see more details in our paper.
Please do not hesitate to contact us or create an issue/PR if you have any questions or suggestions!
Step 1. Clone this repository and setup the environment. We recommend you to install the dependencies via the fast package management tool mamba (you can also replace the command 'mamba' with 'conda' to install them). Generally, CoPRA works with Python 3.10.14 and PyTorch version 2.1.2.
git@github.com:hanrthu/CoPRA.git
cd CoPRA
mamba env create -f environment.yml
Step 2. Install flash-attn and rinalmo with the following command, you may also need to download Rinalmo-650M model and place it at ./weights folder of this repo.
# Download flash-attn-2.6.3 wheel file at https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
git clone git@github.com:lbcb-sci/RiNALMo.git
cd RiNALMo
pip install -e .
Here, we first provide our proposed datasets, including PRA310, PRA201 and PRI30k together with an mCSM_RNA dataset, you can easily access them through 🤗Huggingface: /Jesse7/CoPRA_data. The only difference between PRA201 and PRA310 are the selected samples, thus the PRA201 labels and splits are in PRA310/splits/PRA201.csv. Download these datasets and place them at ./datasets folder.
The number of samples of the original dataset is shown below, we take PRA as the abbreviation of Protein-RNA binding affinity:
| Dataset | Type | Size |
|---|---|---|
| PRA310 | PRA | 310 |
| PRA201 | PRA (pair-only) | 201 |
| PRI30k | Unsupervised complexes | 30006 |
| mCSM-RNA | Mutation effect on PRA | 79 |
We also provide a five-fold model checkpoints after pretraining Co-Former with PRI30k and finetune it with PRA310, and they can also be downloaded through 🤗Huggingface: /Jesse7/CoPRA. This repository also contains a pretrained RiNALMo-650M weights. Download these weights at place them at ./weights folder.
The performance of 5-fold cross validation on PRA310 reaches state-of-the-art, and here is the comparison:
Note1: It is normal that the first epoch for training on a new dataset is relatively slow, because we need to conduct the caching procedure.
Note2: We also support LoRA tuning and all-param tuning. For LoRA tuning, just specify lora_tune: true in ./config/models/copra.yml. For all-param tuning, just specify fix_lms: false in ./config/models/copra.yml.
python run.py test dG --model_config ./config/models/copra.yml --data_config ./config/datasets/PRA310.yml --run_config ./config/runs/test_basic.yml
python run.py finetune dG --model_config ./config/models/copra.yml --data_config ./config/datasets/PRA310.yml --run_config ./config/runs/finetune_struct.yml
python run.py finetune dG --model_config ./config/models/copra.yml --data_config ./config/datasets/PRA201.yml --run_config ./config/runs/finetune_struct.yml
python run.py finetune pretune --model_config ./config/models/copra.yml --data_config ./config/datasets/biolip.yml --run_config ./config/runs/pretune_struct.yml
After pretraining, you can continue to finetune on a new dataset with the finetuning scripts and the specification of ckpt for the pretrained model in config/runs/finetune_struct.yml
python run.py test ddG --model_config ./config/models/copra.yml --data_config ./config/datasets/blindtest.yml --run_config ./config/runs/zero_shot_blindtest.yml
If you find our repo useful, please kindly consider citing:
@article{han2024copra,
title={CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction},
author={Han, Rong and Liu, Xiaohong and Pan, Tong and Xu, Jing and Wang, Xiaoyu and Lan, Wuyang and Li, Zhenyu and Wang, Zixuan and Song, Jiangning and Wang, Guangyu and others},
journal={arXiv preprint arXiv:2409.03773},
year={2024}
}

