# Evo 2 

## In this notebook we are going to use Evo 2 to show some of its capabilities

- The installation to use the Evo 2 model is slightly tricky, but we have made guides on how to install it pain free in `helical/models/evo_2/README.md` using either a docker image or a conda environment
- A few things to note before attempting to run Evo 2
    - Evo 2 requires NVIDIA GPUs with compute capability ≥8.9
    - It has to be run on an x86_64 system
    - We run it on Ubuntu 22.04.5 LTS

### Once you have installed everything you can continue with the notebook!

**Imports and Installs**

In [None]:
!pip install evaluate

In [1]:
from helical.models.evo_2 import Evo2, Evo2Config
import subprocess
import torch
# import umap
# import umap.plot
import numpy as np
from sklearn.neighbors import NearestNeighbors
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm

INFO:datasets:PyTorch version 2.6.0 available.


**Download a CodonBERT task dataset**

In [None]:
url = "https://raw.githubusercontent.com/Sanofi-Public/CodonBERT/refs/heads/master/benchmarks/CodonBERT/data/fine-tune/mRFP_Expression.csv"

output_filename = "mRFP_Expression.csv"
wget_command = ["wget", "-O", output_filename, url]

try:
    subprocess.run(wget_command, check=True)
    print(f"File downloaded successfully as {output_filename}")
except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")

In [3]:
dataset = pd.read_csv(output_filename)
train_data = dataset[dataset["Split"] == "train"]
eval_data = dataset[dataset["Split"] == "val"]
test_data = dataset[dataset["Split"] == "test"]

train_data["Sequence"] = train_data["Sequence"].apply(lambda x: x.replace("U", "T"))
eval_data["Sequence"] = eval_data["Sequence"].apply(lambda x: x.replace("U", "T"))
test_data["Sequence"] = test_data["Sequence"].apply(lambda x: x.replace("U", "T"))

**Defining our Evo 2 7B model**

In [4]:
evo2 = Evo2(Evo2Config(model_name="evo2-7b-base", batch_size=1))

100%|██████████| 32/32 [00:00<00:00, 67.79it/s]


Extra keys in state_dict: {'blocks.17.mixer.dense._extra_state', 'unembed.weight', 'blocks.31.mixer.dense._extra_state', 'blocks.24.mixer.dense._extra_state', 'blocks.10.mixer.dense._extra_state', 'blocks.3.mixer.dense._extra_state'}


**Prepare our data for the model**

In [5]:
evo2.model

StripedHyena(
  (embedding_layer): VocabParallelEmbedding(512, 4096)
  (blocks): ModuleList(
    (0): ParallelGatedConvBlock(
      (pre_norm): RMSNorm()
      (post_norm): RMSNorm()
      (filter): HyenaCascade()
      (projections): TELinear()
      (out_filter_dense): Linear(in_features=4096, out_features=4096, bias=True)
      (mlp): ParallelGatedMLP(
        (l1): Linear(in_features=4096, out_features=11008, bias=False)
        (l2): Linear(in_features=4096, out_features=11008, bias=False)
        (l3): Linear(in_features=11008, out_features=4096, bias=False)
      )
    )
    (1-2): 2 x ParallelGatedConvBlock(
      (pre_norm): RMSNorm()
      (post_norm): RMSNorm()
      (filter): HyenaCascade()
      (projections): TELinear()
      (out_filter_dense): Linear(in_features=4096, out_features=4096, bias=True)
      (mlp): ParallelGatedMLP(
        (act): Identity()
        (l1): Linear(in_features=4096, out_features=11008, bias=False)
        (l2): Linear(in_features=4096, out_feat

In [6]:
train_dataset = evo2.process_data(train_data)
eval_dataset = evo2.process_data(eval_data)
test_dataset = evo2.process_data(test_data)

**Now we are going to generate embeddings for the different sequences using Evo 2**

In [7]:
train_embeddings = evo2.get_embeddings(train_dataset)
eval_embeddings = evo2.get_embeddings(eval_dataset)
test_embeddings = evo2.get_embeddings(test_dataset)

Getting embeddings: 100%|██████████| 1021/1021 [04:07<00:00,  4.13it/s]
Getting embeddings: 100%|██████████| 219/219 [00:52<00:00,  4.13it/s]
Getting embeddings: 100%|██████████| 219/219 [00:52<00:00,  4.18it/s]


In [9]:
np.savez("train_embeddings.npz", embeddings=np.array(train_embeddings["embeddings"]), original_lengths=np.array(train_embeddings["original_lengths"]))
np.savez("eval_embeddings.npz", embeddings=np.array(eval_embeddings["embeddings"]), original_lengths=np.array(eval_embeddings["original_lengths"]))
np.savez("test_embeddings.npz", embeddings=np.array(test_embeddings["embeddings"]), original_lengths=np.array(test_embeddings["original_lengths"]))
print("Embeddings saved successfully")

Embeddings saved successfully


**Load the embeddings from the paths so we don't have to regenerate the embeddings everytime**

In [52]:
train_embeddings = np.load("train_embeddings.npz")
eval_embeddings = np.load("eval_embeddings.npz")
test_embeddings = np.load("test_embeddings.npz")
train_embeddings["embeddings"].shape, eval_embeddings["embeddings"].shape, test_embeddings["embeddings"].shape

((1021, 678, 4096), (219, 678, 4096), (219, 678, 4096))

**We define a probing MLP**

In [83]:
import torch.nn.init as init

# Define the model
import torch.nn as nn

head_model = nn.Sequential(
    nn.Linear(4096, 512),
    nn.Linear(512, 1)
)


# Initialize weights using Xavier initialization
def init_weights(m):
    if isinstance(m, torch.nn.Linear):
        init.xavier_uniform_(m.weight)
        if m.bias is not None:
            init.zeros_(m.bias)

head_model.apply(init_weights)

head_model.train()


Sequential(
  (0): Linear(in_features=4096, out_features=512, bias=True)
  (1): Linear(in_features=512, out_features=1, bias=True)
)

In [95]:
from torch.utils.data import DataLoader, TensorDataset
from copy import deepcopy
import torch

def train_model(model: torch.nn.Sequential,
                X_train: torch.Tensor,  
                y_train: torch.Tensor,  
                X_val: torch.Tensor, 
                y_val: torch.Tensor, 
                optimizer = torch.optim.Adam, 
                loss_fn = torch.nn.functional.mse_loss,
                num_epochs = 50,
                batch = 64,
                lr_scheduler_step=30,
                lr_scheduler_gamma=0.1):    

    # Create DataLoader for batching
    train_dataset = TensorDataset(X_train, y_train)
    train_loader = DataLoader(train_dataset, batch_size=batch, shuffle=True)

    # Validation dataset
    val_dataset = TensorDataset(X_val, y_val)
    val_loader = DataLoader(val_dataset, batch_size=batch, shuffle=False)

    optimizer = optimizer(model.parameters(), lr=0.0001)  # Set an initial learning rate

    # Learning rate scheduler
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=lr_scheduler_step, gamma=lr_scheduler_gamma)

    # Ensure model is in training mode
    model.train()

    for epoch in range(num_epochs):
        average_train_loss = 0
        for batch_X, batch_y in train_loader:
            # Forward pass
            outputs = model(batch_X)
            
            # Compute loss
            loss = loss_fn(outputs, batch_y)
            average_train_loss += loss.item()
            
            # Backward pass and optimize
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        
        print(f"Epoch {epoch+1}, Training Loss: {average_train_loss/len(train_loader)}")
        # Step the scheduler
        scheduler.step()
        
        # Validation phase (optional)
        model.eval()
        with torch.no_grad():
            val_losses = []
            for val_X, val_y in val_loader:
                val_outputs = model(val_X)
                val_loss = loss_fn(val_outputs, val_y)
                val_losses.append(val_loss.item())
            
            print(f"Epoch {epoch+1}, Validation Loss: {sum(val_losses)/len(val_losses)}, Learning Rate: {scheduler.get_last_lr()[0]}")
        
        # Set back to training mode for next epoch
        model.train()
        
    model.eval()   
    return model


In [96]:
X_train = np.array([train_emb[-1] for train_emb, original_len in zip(train_embeddings["embeddings"], train_embeddings["original_lengths"])])
X_val = np.array([eval_emb[-1] for eval_emb, original_len in zip(eval_embeddings["embeddings"], eval_embeddings["original_lengths"])])

y_train, y_eval = train_data["Value"].to_numpy(), eval_data["Value"].to_numpy()

head_model_evo_2 = deepcopy(head_model)
head_model_evo_2 = train_model(model=head_model_evo_2, 
                               X_train=torch.from_numpy(X_train), 
                               y_train=torch.tensor(y_train, dtype=torch.float32).unsqueeze(1), 
                               X_val=torch.from_numpy(X_val), 
                               y_val=torch.tensor(y_eval, dtype=torch.float32).unsqueeze(1))

Epoch 1, Training Loss: 102.69164276123047
Epoch 1, Validation Loss: 102.46187591552734, Learning Rate: 0.0001
Epoch 2, Training Loss: 101.68602132797241
Epoch 2, Validation Loss: 101.42153358459473, Learning Rate: 0.0001
Epoch 3, Training Loss: 100.61723613739014
Epoch 3, Validation Loss: 100.30001640319824, Learning Rate: 0.0001
Epoch 4, Training Loss: 99.44958639144897
Epoch 4, Validation Loss: 99.08520698547363, Learning Rate: 0.0001
Epoch 5, Training Loss: 98.18503952026367
Epoch 5, Validation Loss: 97.77401924133301, Learning Rate: 0.0001
Epoch 6, Training Loss: 96.83944082260132
Epoch 6, Validation Loss: 96.36771392822266, Learning Rate: 0.0001
Epoch 7, Training Loss: 95.3786392211914
Epoch 7, Validation Loss: 94.86300849914551, Learning Rate: 0.0001
Epoch 8, Training Loss: 93.83050775527954
Epoch 8, Validation Loss: 93.2645092010498, Learning Rate: 0.0001
Epoch 9, Training Loss: 92.19836378097534
Epoch 9, Validation Loss: 91.57369422912598, Learning Rate: 0.0001
Epoch 10, Train

In [97]:
X_test = np.array([test_emb[-1] for test_emb, original_len in zip(test_embeddings["embeddings"], test_embeddings["original_lengths"])])
X_test = X_test
y_test = test_data["Value"].to_numpy()
predictions_nn = head_model_evo_2(torch.from_numpy(X_test))
y_pred = predictions_nn.detach().cpu().numpy().squeeze()
y_true = np.array(y_test)
y_pred.shape, y_true.shape

((219,), (219,))

In [98]:
loss = torch.nn.MSELoss()
loss_test = loss(predictions_nn, torch.tensor(y_test, dtype=torch.float32))
loss_test.item()

39.27373123168945

In [80]:
import evaluate

spearmanr_metric = evaluate.load("spearmanr")
pearson_metric = evaluate.load("pearsonr")
results_spearman = spearmanr_metric.compute(references=y_true, predictions=y_pred)
result_pearson = pearson_metric.compute(references=y_true, predictions=y_pred)
results_spearman, result_pearson

({'spearmanr': 0.017463609316120843}, {'pearsonr': 0.00015963052435425887})