Multi-Modal Large Language Model Enables Protein Function Prediction

This repository holds the code and data of Multi-Modal Large Language Model Enables Protein Function Prediction.

Update on Jan 5, 2025: change protein encoder from ESM to xtrimoPGLM

Examples

Examples of multi-round dialogues with ProteinChat for Q9U281, Q9XZG9, and Q9LU44.

Introduction

ProteinChat is a versatile, multi-modal large language model designed to predict protein functions from amino acid sequences.
ProteinChat works in a similar way as ChatGPT. Users upload a protein sequence and ask various questions about this protein. ProteinChat will answer these questions in a multi-turn, interactive manner.
The ProteinChat system consists of a protein encoder, a large language model (LLM), and an adaptor. The protein encoder takes a protein sequence as input and learns a representation for this protein. The adaptor transforms the protein representation produced by the protein encoder into another representation that is acceptable to the LLM. The LLM takes the representation transformed by the adaptor and users' questions about this protein as inputs and generates answers. All these components are trained end-to-end. We use xTrimoPGLM-1B as the protein encoder.
To train ProteinChat, we designed (protein, prompt, answer) triplets from the functions and keywords from Swiss-Prot dataset, resulting in ~500k proteins and 1.5 million triplets.

Getting Started

Installation

1. Prepare the code and the environment

Git clone our repository, creating a python environment and ativate it via the following command

git clone https://github.com/mignonjia/ProteinChat.git
cd ProteinChat
conda env create -f environment.yml
conda activate proteinchat

Verify the installation of torch and torchvision is successful by running python -c "import torchvision; print(torchvision.__version__)". If it outputs the version number without any warnings or errors, then you are good to go. If it outputs any warnings or errors, try to uninstall torch by conda uninstall pytorch torchvision torchaudio cudatoolkit and then reinstall them following here. You need to find the correct command according to the CUDA version your GPU driver supports (check nvidia-smi).

2. Prepare the dataset

The dataset contains 462,019 proteins (represented using 3D structures) with 1.5 million instructions. It is curated from the Swiss-Prot Dataset. The dataset data.tar.gz (148 MB) can be downloaded here. Copy it under this folder and run

tar -xvf data.tar.gz

You will obtain a data folder with three subfolders train_set, valid_set, and test_set.

3. Prepare the pretrained Vicuna weights

The current version of ProteinChat is built on Vicuna-13B-v1.5. Please download Vicuna weights from https://huggingface.co/lmsys/vicuna-13b-v1.5. Then, set the path to the vicuna weight in the config files configs/proteinchat_stage1.yaml and configs/proteinchat_stage2.yaml.

4. Prepare the xtrimoPGLM protein encoder

Download proteinglm-1b-mlm[https://huggingface.co/Bo1015/proteinglm-1b-mlm] to your local machine, and in your downloaded proteinglm folder, modify code Line715 - Line720 of modeling_proteinglm.py to the following:

        if output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

        # Final layer norm.
        if self.post_layer_norm:
            hidden_states = self.final_layernorm(hidden_states)

Then in configs/proteinchat_eval.yaml, set glm_load_path to your local path of proteinglm. Also, download ProteinChat's trained weights from Google Drive and set its path to stage1_ckpt in configs/proteinchat_eval.yaml.

Training

You need at least 55 GB GPU memory for the training.

The stage-1 training configuration file is configs/proteinchat_stage1.yaml. In addition, you may want to change the number of epochs and other hyper-parameters there, such as max_epoch, init_lr, min_lr,warmup_steps, batch_size_train. Please adjust iters_per_epoch so that iters_per_epoch * batch_size_train = your training set size.

Also, set your desired output directory here.

Start stage-1 training by running

bash finetune.sh --cfg-path configs/proteinchat_stage1.yaml

The stage-2 training configuration file is configs/proteinchat_stage2.yaml. Replace the stage1_ckpt with the checkpoint you obtained in stage 1. Similar with the previous step, you also need to replace the output directory in this file.

Start stage-2 training by running

bash finetune.sh --cfg-path configs/proteinchat_stage2.yaml

Evaluation

It takes around 24 GB GPU memory for the inference.

Modify the checkpoint paths in configs/proteinchat_eval.yaml to the location of your checkpoint. Download our ProteinChat's stage1_ckpt here. peft_ckpt can be set empty during evaluation. To evaluate stage-2, this parameter needs to be set False.

Evaluate on 20 samples on free-form function prediction and 10 samples for each specific-category prediction by running

bash demo.sh

Acknowledgement

License

This repository is under BSD 3-Clause License.

Citation

If you're using ProteinChat in your research or applications, please cite using this BibTeX:

@article{huo2024multi,
  title={Multi-modal large language model enables protein function prediction},
  author={Huo, Mingjia and Guo, Han and Cheng, Xingyi and Singh, Digvijay and Rahmani, Hamidreza and Li, Shen and Gerlof, Philipp and Ideker, Trey and Grotjahn, Danielle A and Villa, Elizabeth and Song, Le and Xie, Pengtao},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
fig		fig
proteinchat		proteinchat
.gitignore		.gitignore
README.md		README.md
demo.sh		demo.sh
environment.yml		environment.yml
eval.py		eval.py
finetune.sh		finetune.sh
inference_all.py		inference_all.py
train_esm.py		train_esm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal Large Language Model Enables Protein Function Prediction

Examples

Introduction

Getting Started

Installation

Training

Evaluation

Acknowledgement

License

Citation

About

Releases

Packages

Languages

mignonjia/ProteinChat

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal Large Language Model Enables Protein Function Prediction

Examples

Introduction

Getting Started

Installation

Training

Evaluation

Acknowledgement

License

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages