<a href="https://colab.research.google.com/github/pinellolab/DNA-Diffusion/blob/enformer-implementation/dna-diffusion/metrics/gene_expression/enformer/enformer_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enformer inference notbook
This notebook executes all functionality related to preprocessing sequence data and performing Enformer inference. As input we need to provide a gene and this code handles fetching the sequence of the gene. Then we need to make sure we extend the window 200kb around the transcription start site because Enformer only accepts 200kb inputs. Then we copy-and-paste our generated regulatory sequence instead of one of the regulatory elements of the gene we are considering and run the inference. At the moment, we do not used novel generated sequences yet as the DNA diffusion integration is not completed yet. Instead, we use the ABC data for the time being, which is a dataset containing regulatory sequences from the human genome.


#### TODO
*   Change code to only process and run inference on one gene at a time (in order to prevent memory errors)
*   Import supplementary data table 2 from Enformer paper in order to get the cell types and genomic track type.
*   Perform sanity check on DNA diff test data see: https://discord.com/channels/850068776544108564/1024646567833112656/1055581251483996210`



In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')


Mounted at /content/gdrive/


Install all dependencies. When installed set setup to false to prevent time consuming install checks when running entire notebook. 

In [None]:
setup = True

if setup:
  %pip install transformers
  %pip install einops 
  %pip install polars
  %pip install pyfaidx
  %pip install mygene
  !apt-get install bedtools
  %pip install pybedtools
  %pip install biopython
  %pip install enformer-pytorch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 14.9 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 69.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 75.7 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting einops
  Downloading einops-0.6.0-py3-none-any.whl (41 kB)
[K     |████████████████████████████████| 41 kB 541 kB/

In [None]:
import os 
import pandas as pd 
import torch
 
ROOT_DIR = '/content/gdrive/MyDrive/'
PROJ_DIR = 'Colab Notebooks/dna-diffusion/metrics/gene_expression/enformer/'
SUB_DIR = 'enformer_lucidrains_pytorch/'

os.chdir(ROOT_DIR + PROJ_DIR)
from enformer_lucidrains_pytorch.enformer_pytorch import Enformer
from dataloader import EnformerDataLoader
from utils import inference

In [None]:
class EnformerInference:
    def __init__(self, data_path: str, model_path="EleutherAI/enformer-official-rough"):
        if torch.cuda.is_available():
            print("Using GPU")
            device = torch.device("cuda")
        else:
            print("Using CPU")
            device = torch.device("cpu")

        self.device = device
        self.model = Enformer.from_pretrained(model_path).to(device)
        self.data = EnformerDataLoader(pd.read_csv(data_path, sep="\t"))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x.to(self.device))
        

Do not forget to set `Runtime > Change runtime type > GPU` in order to do inference via GPU.

In [None]:
data_path = "abc_data/K562.PositivePredictions.txt"
model = EnformerInference(data_path)
one_hot_seqs = model.data.fetch_sequence()  # this is a dictionary with key being Ensembl ID|Gene Name and the value
# being the one hot encoded sequence as a torch.Tensor
inference(one_hot_seqs, model)


Using NVIDIA GPU


OutOfMemoryError: ignored