# Use Helix-mRNA

## Access the Helical GitHub [here](https://github.com/helicalAI)!

**In this notebook we will dive into using our latest mRNA Bio Foundation Model, Helix-mRNA.**

**We will get and plot embeddings for our data.**

**We will fine-tune the model both using the Helical package**

## If running on colab, run the cell below. Comment out if running locally

In [None]:
!pip uninstall torch torchtext torchaudio -y
!pip install torch==2.3.0 torchtext torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip uninstall mamba-ssm causal-conv1d -y
!pip install mamba-ssm==2.2.2 --no-cache-dir

In [None]:
!pip install helical
!pip uninstall causal-conv1d -y

### Imports

In [None]:
from helical import HelixmRNAConfig, HelixmRNA, HelixmRNAFineTuningModel
import subprocess
import torch
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

INFO:datasets:PyTorch version 2.3.0 available.
INFO:datasets:TensorFlow version 2.17.0 available.


### Download one of CodonBERT's fine-tuning benchmarks

In [None]:
url = "https://raw.githubusercontent.com/Sanofi-Public/CodonBERT/refs/heads/master/benchmarks/CodonBERT/data/fine-tune/mRFP_Expression.csv"

output_filename = "mRFP_Expression.csv"
wget_command = ["wget", "-O", output_filename, url]

try:
    subprocess.run(wget_command, check=True)
    print(f"File downloaded successfully as {output_filename}")
except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")

### Load the dataset as a pandas dataframe and get the splits
- For this example we take a subset of the splits, feel free to run it on the entire dataset!

In [None]:
dataset = pd.read_csv(output_filename)
train_data = dataset[dataset["Split"] == "train"][:10]
eval_data = dataset[dataset["Split"] == "val"][:5]
test_data = dataset[dataset["Split"] == "test"][:5]

### Define our Helix-mRNA model and desired configs

In [4]:
# We set the max length to the maximum length of the sequences in the training data + 10 to include space for special tokens
helix_mrna_config = HelixmRNAConfig(device=device, batch_size=5, max_length=max(len(s) for s in train_data["Sequence"])+10)
helix_mrna = HelixmRNA(helix_mrna_config)

INFO:helical.models.helix_mrna.model:Helix-mRNA initialized successfully.


### Process our training sequences to tokenize them and prepare them for the model

In [5]:
processed_train_data = helix_mrna.process_data(train_data["Sequence"].to_list())

### Generate embeddings for the train data

- We get an embeddings for each letter/token in the sequence, in this case 100 embeddings for each of the 688 tokens and our embedding dimension is 256
- Because the model has a recurrent nature, our final non-special token embedding at the second last position encapsulates everything that came before it

In [19]:
embeddings = helix_mrna.get_embeddings(processed_train_data)
embeddings = embeddings[:, -2, :]
print(embeddings.shape)
print(embeddings[:1])

Getting embeddings: 100%|██████████| 20/20 [00:00<00:00, 71.50it/s]

(100, 256)
[[-9.70379915e-04  5.94667019e-03  1.07590854e-02  5.22067677e-03
  -9.54915071e-04 -6.74154516e-03 -2.91207526e-03 -1.49831397e-03
   1.78750437e-02  5.13957115e-03  8.79890576e-04  1.21943112e-02
  -1.92209042e-03  3.27306171e-03 -2.27077748e-03  4.50014602e-04
   7.30314665e-03 -8.66744318e-04 -8.81821662e-03 -7.57190645e-01
   1.89280566e-02 -4.05776373e-04  6.08320069e-03 -1.78794132e-03
  -8.79776548e-04 -8.19147026e-05  9.60175938e-04 -8.30806512e-03
   5.66601008e-03 -5.93393855e-03 -5.19109843e-03  6.86887605e-03
  -7.94085041e-02 -5.38914884e-03 -1.55241350e-02 -2.42359545e-02
   2.57678051e-03 -9.53892432e-03 -7.16619950e-04  1.50164040e-02
  -9.01486576e-01 -4.68801707e-03  3.71015654e-03 -1.07593695e-02
   9.67101427e-04  5.75249782e-03  2.86138593e-03 -6.41007500e-04
  -3.93231586e-03 -5.53809397e-04  1.72096007e-02 -8.10448000e-06
   1.20042302e-02 -7.83413649e-03  2.40328256e-03  1.44813021e-04
   6.37711585e-03 -2.75100190e-02 -9.19151399e-03  2.25025918e-02




### Fine-tuning the model on our data
- This is a regression task and so our output is 1 continuous value

In [6]:
helix_mrna_fine_tuning_model = HelixmRNAFineTuningModel(helix_mrna_config=helix_mrna_config, fine_tuning_head="regression", output_size=1)

INFO:helical.models.helix_mrna.model:Helix-mRNA initialized successfully.


### Our training data is already processed since the standard Helix-mRNA model and fine-tuning model take the same input!
- We process our eval and test data

In [7]:
processed_eval_data = helix_mrna_fine_tuning_model.process_data(eval_data["Sequence"].to_list())
processed_test_data = helix_mrna_fine_tuning_model.process_data(test_data["Sequence"].to_list())

### Run fine-tuning on the model for this small sample of data

In [13]:
helix_mrna_fine_tuning_model.train(train_dataset=processed_train_data, 
                                   train_labels=train_data["Value"].to_numpy().reshape(-1, 1),
                                   validation_dataset=processed_eval_data, 
                                   validation_labels= eval_data["Value"].to_numpy().reshape(-1, 1),
                                   epochs=5,
                                   loss_function=torch.nn.MSELoss(),
                                   trainable_layers=2)

INFO:helical.models.helix_mrna.fine_tuning_model:Unfreezing the last 2 layers of the Helix_mRNA model.
INFO:helical.models.helix_mrna.fine_tuning_model:Starting Fine-Tuning
Fine-Tuning: epoch 1/5: 100%|██████████| 20/20 [00:00<00:00, 49.31it/s, loss=86.9]
Fine-Tuning Validation: 100%|██████████| 4/4 [00:00<00:00, 56.86it/s, val_loss=88.8]
Fine-Tuning: epoch 2/5: 100%|██████████| 20/20 [00:00<00:00, 59.90it/s, loss=80.6]
Fine-Tuning Validation: 100%|██████████| 4/4 [00:00<00:00, 53.01it/s, val_loss=83.3]
Fine-Tuning: epoch 3/5: 100%|██████████| 20/20 [00:00<00:00, 51.32it/s, loss=74.9]
Fine-Tuning Validation: 100%|██████████| 4/4 [00:00<00:00, 167.73it/s, val_loss=76.6]
Fine-Tuning: epoch 4/5: 100%|██████████| 20/20 [00:00<00:00, 49.26it/s, loss=67.3]
Fine-Tuning Validation: 100%|██████████| 4/4 [00:00<00:00, 54.98it/s, val_loss=67.4]
Fine-Tuning: epoch 5/5: 100%|██████████| 20/20 [00:00<00:00, 60.40it/s, loss=59.8]
Fine-Tuning Validation: 100%|██████████| 4/4 [00:00<00:00, 50.23it/s, v

### Get outputs from our model on the test data

In [14]:
outputs = helix_mrna_fine_tuning_model.get_outputs(processed_test_data)
print(outputs)

Generating outputs: 100%|██████████| 2/2 [00:00<00:00, 46.17it/s]

[[2.682281 ]
 [2.4182768]
 [2.4362845]
 [2.6120207]
 [2.6543183]
 [2.6988027]
 [2.671821 ]
 [2.144202 ]
 [2.6866376]
 [2.6734226]]



