# EvoDiff Examples

In this notebook we will overview how to approach the following topics: 

* Installation 
* Unconditional generation 
* Evolutionary guided generation 
* Inpainting of intrinsically disordered regions 
* Scaffolding functional motifs 

## Installation

To download and run our code, first open this notebook in a clean conda environment. We recommend creating it with python ```v3.8.5```. You can do so by running ```conda create --name evodiff python=3.8.5```. In that new environment, to download our code, run:

In [None]:
import sys
!{sys.executable} -m pip install evodiff

You will also need to install PyTorch. We tested our models on `v2.0.1`. Change the below line to install the pytorch version that works for your system.

In [None]:
conda install pytorch torchvision torchaudio cpuonly -c pytorch

You also need PyTorch Geometric and PyTorch Scatter installed

In [None]:
conda install pyg -c pyg

In [None]:
conda install -c conda-forge torch-scatter

## Unconditional sequence generation

### Generate a sequence with EvoDiff-Seq-OADM 38M

First, download model information from zenodo. For demonstration purposes, we show an example using the smaller 38M model here, and generation on a CPU. If you are interested in using the model EvoDiff-Seq-OADM 640M, make sure you have ~7GB available to store model checkpoint. Similarly, here we showcase generation on a CPU however, if you have a GPU available, change the device inputs

Anything needed to run uncondtional generation is saved in the checkpoint

In [5]:
from evodiff.pretrained import OA_DM_38M

checkpoint = OA_DM_38M()
model, collater, tokenizer, scheme = checkpoint

To generate one sequence, run:

The only thing you need to define is the desired sequence length via `seq_len` input 

In [6]:
from evodiff.generate import generate_oaardm

seq_len = 100
tokeinzed_sample, generated_sequence = generate_oaardm(model, tokenizer, seq_len, batch_size=1, device='cpu')
print("Generated sequence:", generated_sequence)

100%|██████████| 100/100 [00:10<00:00,  9.80it/s]

Generated sequence: ['MLIENPSLETVCSLKSPYKLYDFELQEIRESWEYTWQVNSEEDKFKSIISGFLRFAEFYQKLVKVSADEVYKIPGELVTNFKLMWKLQAKLSKAKYEVER']





### Generate a sequence with EvoDiff-D3PM-Uniform 38M

Again, we show an example here using the smaller model weights. For D3PM models we need additional inputs for inference, so we download checkpoints with `return_all=True`. If you are using a BLOSUM model, make sure to download the blosum matrix file in `data/` to your local files

In [7]:
from evodiff.pretrained import D3PM_UNIFORM_38M

checkpoint = D3PM_UNIFORM_38M(return_all=True)
model, collater, tokenizer, scheme, timestep, Q_bar, Q = checkpoint

sohl-dickstein


Downloading: "https://zenodo.org/record/8045076/files/d3pm-uniform-38M.tar?download=1" to /Users/nityathakkar/.cache/torch/hub/checkpoints/d3pm-uniform-38M.tar
100%|██████████| 434M/434M [05:46<00:00, 1.31MB/s]  


We can then generate 1 sequence via the following, where again only `seq_len` needs to be defined: 

In [8]:
from evodiff.generate import generate_d3pm

seq_len = 100 

tokeinzed_sample, generated_sequence = generate_d3pm(model, tokenizer, Q, Q_bar, timestep, seq_len, batch_size=1, device='cpu')

100%|██████████| 499/499 [00:57<00:00,  8.73it/s]

final seq ['MLCIRDVAHRVLHKRPAAPIQITAASAAVSLSDSHTAIAASASDAAAVFDSEDRRNRERGEAASGENTMLTVSVASIKQFSLAVGGEDISPAPESGSAPV']





## Conditional generation


### Evolutionary guided sequence generation with EvoDiff-MSA 

To generate a sequence, given a multiple sequence alignment, you must have an MSA avaialble. Our `generate-msa.py` code samples the validation dataset of openfold, then subsamples an MSA `n_sequences` x `seq_length`, and generates a new query sequence for that sampled MSA.

To run the following code on a custom MSA, you must provide the path to an MSA saved as an A3M file and specify the subsampling of `n_sequences` by `seq_length` via the scheme of your choice (`selection_type=random` or `MaxHamming`), where the query sequence is the sequence you want to generate. We have not extensively tested our subsampling code outside of the Openfold dataset. 


*Note: All our conditional generation uses OADM models, currently we do not support conditional generation with D3PM*

To run; first lets download the appropriate weights for EvoDiff-OADM-MSA. Note: our conditional generation tasks only works with OADM models. 

In [18]:
from evodiff.pretrained import MSA_OA_DM_MAXSUB

checkpoint = MSA_OA_DM_MAXSUB()
model, collater, tokenizer, scheme = checkpoint

Next we provide the path to an A3M file in `path_to_msa`, and subsample the MSA to `n_sequences` by `seq_length` using `random` subsampling before we begin our conditional generation task. In that case that the MSA is shorter than the provided seq_length, it will pad additional rows with a `PAD_TOKEN=!`. In this case, the input file contains many fasta sequences, and the query sequence is assigned as the first entry in the A3M file. The subsampled msa returns an MSA with the query sequence in the first row. We will mask out this sequence, and generate a new one in its place. 

We have provided a random test example a3m file under `examples/a3m_example` which we will use here. 

In [19]:
from evodiff.generate_msa import generate_query_oadm_msa_simple
import re

path_to_msa = 'example_files/bfd_uniclust_hits.a3m'
n_sequences=64 # number of sequences in MSA to subsample
seq_length=256 # maximum sequence length to subsample
selection_type='random' # or 'MaxHamming'; MSA subsampling scheme


tokeinzed_sample, generated_sequence  = generate_query_oadm_msa_simple(path_to_msa, model, tokenizer, n_sequences, seq_length, device='cpu', selection_type=selection_type)
    

print("New sequence (no gaps, pad tokens)", re.sub('[!-]', '', generated_sequence[0][0],))

100%|██████████| 164/164 [1:03:03<00:00, 23.07s/it]

New sequence (no gaps, pad tokens) MDLRSSLVEHEGLRWKVYNNAEYVPTIGLGQIHNRPSQYWDYPVPLPEQYAEKDQISWSLETIQAVFDERYTKAKSEMVNLETIGKNFDDLPSEHTNAVTDMMFQLGTDHLSEFHKMITALKNNTYEEACREMKSSFWTRQMGNRCTRYLNDALEENYFFFNHH





### Inpainting IDRs with EvoDiff-Seq

First, lets load the model we want to use

In [11]:
from evodiff.pretrained import OA_DM_38M

checkpoint = OA_DM_38M()
model, collater, tokenizer, scheme = checkpoint

Using an exemplary input `sequence`, we will show you how to inpaint a new region (from `start_idx` to `end_idx`) of that sequence using EvoDiff-Seq

In [12]:
from evodiff.conditional_generation import inpaint_simple 

sequence = 'DQTERTVRSFEGRRTAPYLDSRNVLTIGYGHLLNRPGANKSWEGRLTSALPREFKQRLTELAASQLHETDVRLATARAQALYGSGAYFESVPVSLNDLWFDSVFNLGERKLLNWSGLRTKLESRDWGAAAKDLGRHTFGREPVSRRMAESMRMRRGIDLNHYNI'
start_idx = 20
end_idx = 50


sample, entire_sequence, generated_idr = inpaint_simple(model, sequence, start_idx, end_idx, tokenizer=tokenizer, device='cpu')

print("original sequence:", sequence)
print("generated sequence", entire_sequence)


print("\noriginal region:", sequence[start_idx:end_idx])
print("generated region:", generated_idr)

100%|██████████| 30/30 [00:03<00:00,  8.00it/s]

original sequence: DQTERTVRSFEGRRTAPYLDSRNVLTIGYGHLLNRPGANKSWEGRLTSALPREFKQRLTELAASQLHETDVRLATARAQALYGSGAYFESVPVSLNDLWFDSVFNLGERKLLNWSGLRTKLESRDWGAAAKDLGRHTFGREPVSRRMAESMRMRRGIDLNHYNI
generated sequence DQTERTVRSFEGRRTAPYLDTMVAVGQGENPGLMKPMSESADELLRQPPPPREFKQRLTELAASQLHETDVRLATARAQALYGSGAYFESVPVSLNDLWFDSVFNLGERKLLNWSGLRTKLESRDWGAAAKDLGRHTFGREPVSRRMAESMRMRRGIDLNHYNI

original region: SRNVLTIGYGHLLNRPGANKSWEGRLTSAL
generated region: TMVAVGQGENPGLMKPMSESADELLRQPPP





### Inpainting IDRs with EvoDiff-MSA

Inpainting via EvoDiff-MSAs then follows the same approach as conditional-generation of a query sequence, given an MSA. One must ensure critical care in the subsampling of MSAs in this task, and ensure that the subsampled indices correspond to the correct regions you are interested in subsampling. 

### Scaffolding functional motifs with EvoDiff-Seq

Below we provide an example of scaffolding the PDB-ID: 1PRW. 

EvoDiff-Seq will search for a PDB file, given the 3 letter PDB code and attempt to download it if it cannot find it in the local directory. Raw PDB files can be missing residues, or have extra residues so we want to express extreme caution when using this code, to ensure that you are scaffolding the correct indices.  

First, provide the start (`start_idx`) and ending (`end_idx`) indices for the motif you are interested in scaffolding (the end index is includsive). If there are multiple domains, make sure these are indexed in numerical order. In the example below; `start_idx=[51,15]`, `end_idx=[70,34]` is not acceptable. For multiple domains, we retain the original spacing between motifs, and extract the entire domain from `start_idx[0]` to `end_idx[-1]` and fill the non-motif regions with a `MASK` token. 

This code indexes a data folder `scaffolding-pdbs`, ensure that folder is in the working path 

In [13]:
pdb_code = '1prw'

start_idx = [15, 51]
end_idx = [34, 70]

num_seqs = 1

data_top_dir = './' # Change this filepath to represent where scaffolding-msas and scaffolding-pdbs exists, this should be in the same folder as this notebook 

Next, we can specify what `scaffold_length` we want to generate. The code will randomly sample the location of the motif within the specified scaffold length. 

In [14]:
from evodiff.conditional_generation import generate_scaffold

scaffold_length = 75 

generated_sequence, new_start_idx, new_end_idx = generate_scaffold(model, pdb_code, start_idx, end_idx, scaffold_length, data_top_dir, tokenizer, device='cpu')

print("motif start indices", new_start_idx)
print("motif end indices", new_end_idx)

ALREADY DOWNLOADED
CLEANING PDB
sequence extracted from pdb ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGELTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK
sequence length 147
motif extracted from indexes supplied: FSLFDKDGDGTITTKELGTV################INEVDADGNGTIDFPEFLTM
Generated sequence: ['FKEAFSLFDKDGDGTITTKELGTVMRSLGQNPLEAELQDMINEVDADGNGTIDFPEFLTMMAHKMKDTDSKEEIREAFRVFDRDGNGTILKEELRRTFREMRDSLSEFEQIDKDKDGAIWIEEYQSTPKKS']
motif start indices [4, 40]
motif end indices [23, 59]


### Scaffolding functional motifs with EvoDiff-MSA

EvoDiff-MSA requires an a3m formatted MSA in the data folder to proceed, for simplicity we did not wrap any homology tools for automatic MSA-generation. For generation, you must create an A3M, subsample an alignment (preserving the correct indices), and use this to generate a new query-sequence. 

If you would like to analyze the generated structure by comparing it to the original using the RMSD score, look at the analysis/rmsd_analysis.py script

#### We provide PDB files in the ` examples/scaffolding-pdbs` folder. You can use the following code segment to visualize the various PDB files and pick one.

In [15]:
# Specify PDB code
pdb = '1prw' 

In [16]:
!{sys.executable} -m pip install py3Dmol

import py3Dmol

view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js')
view.addModel(open('scaffolding-pdbs/' + pdb + '.pdb','r').read(),'pdb')

view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min':0.5,'max':0.9}}}) # as color is set to lDDT
# view.setStyle({'cartoon': {'color':'spectrum'}})

view.zoomTo()
view.show()

Collecting py3Dmol
  Obtaining dependency information for py3Dmol from https://files.pythonhosted.org/packages/47/69/b295c4c0f7c9e9ddbb3f94577c0b15ddedb4dbbf08a451bdac5d0f5d4831/py3Dmol-2.0.3-py2.py3-none-any.whl.metadata
  Using cached py3Dmol-2.0.3-py2.py3-none-any.whl.metadata (2.1 kB)
Using cached py3Dmol-2.0.3-py2.py3-none-any.whl (12 kB)
Installing collected packages: py3Dmol
Successfully installed py3Dmol-2.0.3
