# Inference with Isoformer

###This notebook demonstrates how to use the Isoformer model for multi-omics data analysis and gene expression prediction. It shows how to load the model, process DNA, RNA, and protein sequences, and perform inference to predict gene expression levels.

[![Open All Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/instadeepai/nucleotide-transformer/blob/main/notebooks/isoformer/inference.ipynb)

## Installation and Dependencies

###First, let's install the required packages for running the Isoformer model.

In [None]:
! pip install -U huggingface_hub
! pip install -U datasets
! pip install transformers 
! pip install torch
! pip install enformer_pytorch
! pip install tqdm
! pip install pyfaidx
! pip install pandas
! pip install pathlib
! pip install urllib
! pip install shutil
! pip install ssl

## Import Required Libraries

###Import the necessary libraries for data processing and model inference.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForMaskedLM
import numpy as np
import torch

## Load Dataset

###Load the multi-omics transcript expression dataset. We'll use the test split with a sequence length of 196,608 nucleotides.

In [None]:
# Load the dataset
transcript_expression_dataset = load_dataset(
    "InstaDeepAI/multi_omics_transcript_expression",
    task_name="transcript_expression_expression",
    sequence_length=196608,
    filter_out_sequence_length=196608,
    split="test",
    streaming=False,
    light_version=True, # Set to False to use the full dataset
)
dataset = iter(transcript_expression_dataset)

## Load Model and Tokenizer

###Load the pre-trained Isoformer model and its tokenizer from Hugging Face.

In [None]:
# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/isoformer", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/isoformer",trust_remote_code=True)

## Prepare Input Data

###Prepare the input sequences for DNA, RNA, and protein data.

In [None]:
# Sample data
sample_data = next(dataset)
protein_sequences = [sample_data["Protein"]]
rna_sequences = [sample_data["RNA"]]
dna_sequences = [sample_data["DNA"]]
sequence_length = 196_608
rng = np.random.default_rng(seed=0)

## Tokenize Input Sequences

###Tokenize the input sequences for the model.

In [None]:
# Tokenize
torch_tokens = tokenizer(
    dna_input=dna_sequences, rna_input=rna_sequences, protein_input=protein_sequences
)
dna_torch_tokens = torch.tensor(torch_tokens[0]["input_ids"])
rna_torch_tokens = torch.tensor(torch_tokens[1]["input_ids"])
protein_torch_tokens = torch.tensor(torch_tokens[2]["input_ids"])

## Run Model Inference

###Perform inference using the Isoformer model to predict gene expression levels and obtain DNA embeddings.

In [None]:
# Inference
torch_output = model.forward(
    tensor_dna=dna_torch_tokens,
    tensor_rna=rna_torch_tokens,
    tensor_protein=protein_torch_tokens,
    attention_mask_rna=rna_torch_tokens != 1,
    attention_mask_protein=protein_torch_tokens != 1,
)

print(f"Gene expression predictions: {torch_output['gene_expression_predictions']}")
print(f"Final DNA embedding: {torch_output['final_dna_embeddings']}")