Model Card for EvoDiff

Generation of protein sequences and evolutionary alignments via discrete diffusion models.

In this work, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design.

Developed by: Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex X. Lu, Nicolo Fusi, Ava P. Amini, Kevin K. Yang
Shared by: Microsoft Research New England
Model type: Diffusion-based protein sequence generation
License: MIT License

Model Sources

Repository: https://github.com/microsoft/evodiff
Preprint: https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1

Uses

Direct Use

This model is intended for research use. It can be used directly to generate proteins sequences and alignments. We provide checkpoints for all our models so users can run our unconditional and conditional generation scripts.

We provide a notebook with installation guidance that can be found in examples/evodiff.ipynb. It also includes examples on how to generate a smaller number of sequences and MSAs using our models. We recommend following this notebook if you would like to use our models to generate proteins.

To load a model:

from evodiff.pretrained import OA_DM_38M

model, collater, tokenizer, scheme = OA_DM_38M()

Available models are:

D3PM_BLOSUM_640M()
D3PM_BLOSUM_38M()
D3PM_UNIFORM_640M()
D3PM_UNIFORM_38M()
OA_DM_640M()
OA_DM_38M()
LR_AR_640M()
LR_AR_38M()
MSA_D3PM_BLOSUM_RANDSUB()
MSA_D3PM_BLOSUM_MAXSUB()
MSA_D3PM_UNIFORM_RANDSUB()
MSA_D3PM_UNIFORM_MAXSUB()
MSA_OA_DM_RANDSUB()
MSA_OA_DM_MAXSUB()

Note: if you want to download a BLOSUM model, you will first need to download data/blosum62-special-MSA.mat.

Please view our README.md for detailed instructions on how to generate sequences and multiple sequence alignments (MSAs) both unconditionally and conditionally.

Out-of-Scope Use

This model is intended for use on protein sequences. It is not meant for other biological sequences, such as DNA sequences, or natural language.

Bias, Risks, and Limitations

This model will not perform well when trying to generate things that aren't proteins. This includes cases such as trying to generate other biological sequences, such as DNA sequences, or natural language. In other words, the model will perform best on data within the data distribution, which includes protein sequences and multiple sequence alignments (MSAs).

How to Get Started with the Model

To download our code, we recommend creating a clean conda environment with python v3.8.5.

conda create --name evodiff python=3.8.5

In that new environment, install EvoDiff:

pip install evodiff
pip install git+https://github.com/microsoft/evodiff.git # bleeding edge, current repo main branch

You will also need to install PyTorch (we tested our models on v2.0.1), PyTorch Geometric, and PyTorch Scatter.

Our downstream analysis scripts make use of a variety of tools we do not include in our package installation. To run the scripts, please download the following packages in addition to EvoDiff:

TM score
Omegafold
ProteinMPNN
ESM-IF1; see this Jupyter notebook for setup details.
PGP
DISOPRED3
DR-BERT

We refer to the setup instructions outlined by the authors of those tools.

Training Details

Training Data

We obtain sequences from the Uniref50 dataset, which contains approximately 42 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the OpenFold dataset, which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. The intrinsically disordered regions (IDR) data was obtained from the Reverse Homology GitHub.

For the scaffolding structural motifs task, we provide pdb and fasta files used for conditionally generating sequences in the examples/scaffolding-pdbs folder. We also provide We provide pdb files used for conditionally generating MSAs in the examples/scaffolding-msas folder.

Evaluation

Testing Data

To access the UniRef50 test sequences, use the following code:

test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences

We provide all generated sequences on the EvoDiff Zenodo.

To download our unconditional generated sequences from unconditional_generations.csv file:

curl -O https://zenodo.org/record/8329165/files/unconditional_generations.csv?download=1

To extract all unconditionally generated sequences created using the EvoDiff-seq oa_dm_640M model, run the following code:

import pandas as pd
df = pd.read_csv('unconditional_generations.csv', index_col = 0)
subset = df.loc[df['model'] == 'evodiff_oa_dm_640M']

Please view our README.md for more information about the CSV files containing generated data.

Metrics

To analyze the quality of the generations, we look at:

amino acid KL divergence (aa_reconstruction_parity_plot)
secondary structure KL divergence (evodiff/analysis/calc_kl_ss.py)
model perplexity for sequences (evodiff/analysis/sequence_perp.py)
model perplexity for MSAs (evodiff/analysis/msa_perp.py)
Fréchet inception distance (evodiff/analysis/calc_fid.py)
Hamming distance (evodiff/analysis/calc_nearestseq_hamming.py)
RMSD score (analysis/rmsd_analysis.py)

We also compute the self-consistency perplexity to evaluate the foldability of generated sequences. To do so, we make use of various tools:

TM score
Omegafold
ProteinMPNN
ESM-IF1; see this Jupyter notebook for setup details.
PGP
DISOPRED3
DR-BERT

We refer to the setup instructions outlined by the authors of those tools.

Our analysis scripts for iterating over these tools are in the evodiff/analysis/downstream_bash_scripts folder. Once we run the scripts in this folder, we analyze the results in self_consistency_analysis.py.

Summary

We present EvoDiff, a diffusion modeling framework capable of generating high-fidelity, diverse, and novel proteins with the option of conditioning according to sequence constraints. Because it operates in the universal protein design space, EvoDiff can unconditionally sample diverse structurally-plausible proteins, generate intrinsically disordered regions, and scaffold structural motifs using only sequence information, challenging a paradigm in structure-based protein design.

Environmental Impact

Hardware Type: 32GB NVIDIA V100 GPUs
Hours used: 4,128 (14 days per sequence model, 10 days per MSA model)
Cloud Provider: Azure
Compute Region: East US
Carbon Emitted: 485.21 kg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EvoDiff_modelcard.md

EvoDiff_modelcard.md

Model Card for EvoDiff

Table of Contents

Model Details

Model Description

Model Sources

Uses

Direct Use

Out-of-Scope Use

Bias, Risks, and Limitations

How to Get Started with the Model

Training Details

Training Data

Evaluation

Testing Data

Metrics

Summary

Environmental Impact

Files

EvoDiff_modelcard.md

Latest commit

History

EvoDiff_modelcard.md

File metadata and controls

Model Card for EvoDiff

Table of Contents

Model Details

Model Description

Model Sources

Uses

Direct Use

Out-of-Scope Use

Bias, Risks, and Limitations

How to Get Started with the Model

Training Details

Training Data

Evaluation

Testing Data

Metrics

Summary

Environmental Impact