## Training Demo Enformer Celltyping

This workbook steps through training Enformer Celltyping using the small subset of data available in the repo.

## Step 1 Create pre-trained enformer model 

The enformer model available from tensorflow hub needs to be downloaded, the weights
lifted and added to a recreated enformer model with the final layers after the attention
layers removed.

In [1]:
#enformer imports
import tensorflow as tf
from EnformerCelltyping.utils import(gelu, create_enf_model, 
                                     pearsonR)



2023-02-02 14:49:38.087627: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


In [2]:
from EnformerCelltyping.constants import DATA_PATH
from EnformerCelltyping.utils import create_enf_chopped_model
enf = create_enf_chopped_model(str(DATA_PATH / "enformer_model"))

OSError: SavedModel file does not exist at: /home/aemurphy/RDS/project/celltypeai/live/Projects/EnformerCelltyping/data/enformer_model/{saved_model.pbtxt|saved_model.pb}

In [2]:
#create Enformer Celltyping - for training
from EnformerCelltyping.enf_celltyping import Enformer_Celltyping
from EnformerCelltyping.utils import pearsonR

assays = ['h3k27ac', 'h3k4me1', 'h3k4me3', 'h3k9me3', 'h3k27me3', 'h3k36me3']
learning_rate = 0.0002

#Using all the model default parameters for the architecture
#Set use_prebuilt_model to False since we want to train 
#Enformer Celltyping from scratch
model = Enformer_Celltyping(assays=assays,
                            use_prebuilt_model=False)
#compile the model, the model is separated into 2 channels: 
# 1. DNA channel which predicts an average histone mark score across
#    all trainig cell types
# 2. Chromatin accessibility channel which predicts the delta between
#    the average histone mark value for that region and the cell type-
#    specific one.
# Thus there are 2 loss functions
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
          loss={'avg':tf.keras.losses.poisson,
                'delta':tf.keras.losses.mean_squared_error},
          metrics=['mse',pearsonR])

#Let's view the model
model.summary()

2023-02-02 14:49:40.707070: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2023-02-02 14:49:40.731594: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-02-02 14:49:40.731613: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (workstation-neurogenomics): /proc/driver/nvidia/version does not exist
2023-02-02 14:49:40.732144: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "EnfCelltyping"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
ChromAccessLclInput (InputLayer [(None, 1562)]       0                                            
__________________________________________________________________________________________________
dnaInput (InputLayer)           [(None, 896, 1536)]  0                                            
__________________________________________________________________________________________________
tf.clip_by_value (TFOpLambda)   (None, 1562)         0           ChromAccessLclInput[0][0]        
__________________________________________________________________________________________________
dense1_dna (Dense)              (None, 896, 1536)    2360832     dnaInput[0][0]                   
______________________________________________________________________________________

This model summary does not include the layers of enformer since the DNA for the demo regions
has already been passed through the pre-trained enformer model. Since the layers are frozen 
these weights will not update so we can save on compute time and RAM by pre-running it. If
you want to build a version of Enformer Celltyping with the enformer layers run:

```

```

In [3]:
import numpy as np
a = np.load("/home/aemurphy/data/model_ref/avg_atac_128.npz")
a['h3k27ac']

KeyError: 'h3k27ac is not a file in the archive'