# Use Helix-mRNA

## Access the Helical GitHub [here](https://github.com/helicalAI)!

**In this notebook we will dive into using our latest mRNA Bio Foundation Model, Helix-mRNA.**

**We will get and plot embeddings for our data.**

**We will fine-tune the model both using the Helical package**

If running on a CUDA device compatible with mamba-ssm and causal-conv1d install the package below, otherwise remove the [mamba-ssm] optional dependency
- If running on colab, remove the [mamba-ssm] dependency

In [None]:
!pip install --upgrade helical[mamba-ssm]

### Imports

In [None]:
from helical.models.helix_mrna import HelixmRNAConfig, HelixmRNA, HelixmRNAFineTuningModel
import subprocess
import torch
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

### Download one of CodonBERT's fine-tuning benchmarks

In [None]:
url = "https://raw.githubusercontent.com/Sanofi-Public/CodonBERT/refs/heads/master/benchmarks/CodonBERT/data/fine-tune/mRFP_Expression.csv"

output_filename = "mRFP_Expression.csv"
wget_command = ["wget", "-O", output_filename, url]

try:
    subprocess.run(wget_command, check=True)
    print(f"File downloaded successfully as {output_filename}")
except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")

### Load the dataset as a pandas dataframe and get the splits
- For this example we take a subset of the splits, feel free to run it on the entire dataset!

In [None]:
dataset = pd.read_csv(output_filename)
train_data = dataset[dataset["Split"] == "train"][:10]
eval_data = dataset[dataset["Split"] == "val"][:5]
test_data = dataset[dataset["Split"] == "test"][:5]

### Define our Helix-mRNA model and desired configs

In [None]:
# We set the max length to the maximum length of the sequences in the training data + 10 to include space for special tokens
helix_mrna_config = HelixmRNAConfig(device=device, batch_size=1, max_length=max(len(s) for s in train_data["Sequence"])+10)
helix_mrna = HelixmRNA(helix_mrna_config)

### Process our training sequences to tokenize them and prepare them for the model

In [None]:
processed_train_data = helix_mrna.process_data(train_data["Sequence"].to_list())

### Generate embeddings for the train data

- We get an embeddings for each letter/token in the sequence, in this case 100 embeddings for each of the 688 tokens and our embedding dimension is 256
- Because the model has a recurrent nature, our final non-special token embedding at the second last position encapsulates everything that came before it

In [None]:
embeddings = helix_mrna.get_embeddings(processed_train_data)
embeddings = embeddings[:, -2, :]
print(embeddings.shape)
print(embeddings[:1])

### Fine-tuning the model on our data
- This is a regression task and so our output is 1 continuous value

In [None]:
helix_mrna_fine_tuning_model = HelixmRNAFineTuningModel(helix_mrna_config=helix_mrna_config, fine_tuning_head="regression", output_size=1)

### Our training data is already processed since the standard Helix-mRNA model and fine-tuning model take the same input!
- We process our eval and test data

In [None]:
processed_eval_data = helix_mrna_fine_tuning_model.process_data(eval_data["Sequence"].to_list())
processed_test_data = helix_mrna_fine_tuning_model.process_data(test_data["Sequence"].to_list())

### Run fine-tuning on the model for this small sample of data

In [None]:
helix_mrna_fine_tuning_model.train(train_dataset=processed_train_data, 
                                   train_labels=train_data["Value"].to_numpy().reshape(-1, 1),
                                   validation_dataset=processed_eval_data, 
                                   validation_labels= eval_data["Value"].to_numpy().reshape(-1, 1),
                                   epochs=5,
                                   loss_function=torch.nn.MSELoss(),
                                   trainable_layers=2)

### Get outputs from our model on the test data

In [None]:
outputs = helix_mrna_fine_tuning_model.get_outputs(processed_test_data)
print(outputs)