# Train neural networks with `momics` and `tensorflow`

`momics` provides several useful resources to train neural networks with `tensorflow`. This notebook demonstrates how to train a simple neural network with `momics` and `tensorflow`.

## Connect to the data repository

We will tap into the repository generated in the [previous tutorial](integrating-multiomics.ipynb). 

In [None]:
from momics import momics as mmm

## Creating repository
repo = mmm.Momics("yeast_CNN_data.momics")

## Check that sequence and some tracks are registered
repo.seq()
repo.tracks()


momics :: INFO :: 2025-03-29 23:27:16,813 :: No cloud config found for momics.Consider populating `~/.momics.ini` file with configuration settings for cloud access.


Unnamed: 0,idx,label,path
0,0,atac,/home/jaseriza/repos/momics/data/S288c_atac.bw
1,1,scc1,/home/jaseriza/repos/momics/data/S288c_scc1.bw
2,2,mnase,/home/jaseriza/repos/momics/data/S288c_mnase.bw


## Modify some tracks

We can first pre-process the tracks to normalize them, and save them back to the local repository.

In [None]:
import numpy as np

for track in ["atac", "mnase"]:
    cov = repo.tracks(track)
    # Compute genome-wide 99th percentile
    q99 = np.nanpercentile(np.concatenate(list(cov.values())), 99)
    for chrom in cov.keys():
        arr = cov[chrom]
        # Truncate to genome-wide 99th percentile
        arr = np.minimum(arr, q99)
        # Rescale to [0, 1]
        arr = (arr - np.nanmin(arr)) / (np.nanmax(arr) - np.nanmin(arr))
        # Convert NaNs to 0
        arr = np.nan_to_num(arr, nan=0)
        # Store back
        cov[chrom] = arr
    repo.ingest_track(cov, track + "_rescaled")

repo.tracks()


momics :: INFO :: 2025-03-29 23:27:21,396 :: 1 tracks ingested in 1.4004s.
momics :: INFO :: 2025-03-29 23:27:26,492 :: 1 tracks ingested in 1.6719s.


Unnamed: 0,idx,label,path
0,0,atac,/home/jaseriza/repos/momics/data/S288c_atac.bw
1,1,scc1,/home/jaseriza/repos/momics/data/S288c_scc1.bw
2,2,mnase,/home/jaseriza/repos/momics/data/S288c_mnase.bw
3,3,atac_rescaled,tmpbyibsue1
4,4,mnase_rescaled,tmpoqeo1dwd


## Define datasets and model 

We will define a simple convolutional neural network with `tensorflow` to predict the target variable `ATAC` from the feature variable `MNase`. This requires to first define a set of genomic coordinates to extract genomic data from. We will use `MNase_rescaled` coverage scores over tiling genomic windows (`features_size` of `1025`, with a stride of `48`) as feature variables to predict `ATAC_rescaled` coverage scores over the same tiling genomic windows, but narrowed down to the a `target_size` of `24` bp around the center of the window. We can split the data into training, testing and validation sets, using `momics.utils.split_ranges()`.

In [None]:
import momics.utils as mutils

# Fetch data from the momics repository
features_size = 8192 + 1
stride = 256

bins = repo.bins(width=features_size, stride=stride, cut_last_bin_out=True)
bins = bins.subset(lambda x: x.Chromosome != "XVI")
bins_split, bins_test = mutils.split_ranges(bins, 0.8)
bins_train, bins_val = mutils.split_ranges(bins_split, 0.8)
bins_train


Unnamed: 0,Chromosome,Start,End
0,I,137472,145665
1,I,59392,67585
2,I,137216,145409
3,I,32512,40705
4,I,30720,38913
...,...,...,...
27695,XV,1025024,1033217
27696,XV,773120,781313
27697,XV,533248,541441
27698,XV,416768,424961


We now need to define different datasets, for training, testing and validation. We will use `momics.dataset.MomicsDataset()` constructor, indicating the batch size we wish to use in the training process.

In [None]:
from momics.dataset import MomicsDataset

features = "mnase_rescaled"
target = "atac_rescaled"
target_size = 512
batch_size = 500

train_dataset = (
    MomicsDataset(repo, bins_train, features, target, target_size=target_size, batch_size=batch_size)
    .shuffle(10)
    .prefetch(2)
    .repeat()
)
val_dataset = MomicsDataset(repo, bins_val, features, target, target_size=target_size, batch_size=batch_size)
test_dataset = MomicsDataset(repo, bins_test, features, target, target_size=target_size, batch_size=batch_size)
train_dataset


2025-03-29 23:27:26.792513: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-29 23:27:26.801110: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743287246.811634 3927205 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743287246.814957 3927205 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-29 23:27:26.826531: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

<_RepeatDataset element_spec=((TensorSpec(shape=(None, 8193, 1), dtype=tf.float32, name=None),), (TensorSpec(shape=(None, 512, 1), dtype=tf.float32, name=None),))>

Now is time to define the model architecture. In this example, we will use a simple customizable convolutional neural network (`ChromNN`), provided in `momics.nn`. We can instantiate the model with the number and shape of layers we want, and compile it with the desired optimizer, loss function and metrics.

In [None]:
from momics.nn import ChromNN
from momics.nn import mae_cor
import tensorflow as tf  # type: ignore
from tensorflow.keras import layers  # type: ignore

## Define the model with three convolutional layers
model = ChromNN(
    input=layers.Input(shape=(features_size, 1)),
    output=layers.Dense(target_size, activation="linear"),
    filters=[64, 16, 8],
    kernel_sizes=[3, 8, 80],
).model


## Use a combination of MAE and correlation as loss function
def loss_fn(y_true, y_pred):
    return mae_cor(y_true, y_pred, alpha=0.9)


## Use Adam optimizer, a learning rate of 0.001, and return MAE as metric
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=loss_fn,
    metrics=["mae"],
)
model.summary()


## Fit the model 

Now that we have the datasets and the model, we can fit the model to the training data, using the `fit()` method of the model. We can also evaluate the model on the testing and validation datasets. Here, we'll quickly iterate over 10 epochs, but you can increase this number to improve the model performance. 

In [None]:
import os
import numpy as np
from pathlib import Path
from tensorflow.keras.callbacks import CSVLogger, EarlyStopping, ModelCheckpoint, ReduceLROnPlateau  # type: ignore

os.makedirs(".chromnn", exist_ok=True)
callbacks_list = [
    CSVLogger(Path(".chromnn", "epoch_data.csv")),
    ModelCheckpoint(filepath=Path(".chromnn", "Checkpoint.keras"), monitor="val_loss", save_best_only=True),
    EarlyStopping(monitor="val_loss", patience=40, min_delta=1e-5, restore_best_weights=True),
    ReduceLROnPlateau(monitor="val_loss", factor=0.1, patience=6 // 2, min_lr=0.1 * 0.001),
]
model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=10,
    callbacks=callbacks_list,
    steps_per_epoch=int(np.floor(len(bins_train) // batch_size)),
)


Epoch 1/10


Expected: keras_tensor_1
Received: inputs=('Tensor(shape=(None, 8193, 1))',)
I0000 00:00:1743287250.405082 3927603 service.cc:148] XLA service 0x7f88fc021190 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1743287250.405103 3927603 service.cc:156]   StreamExecutor device (0): NVIDIA RTX A2000 12GB, Compute Capability 8.6
2025-03-29 23:27:30.456058: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1743287250.689129 3927603 cuda_dnn.cc:529] Loaded cuDNN version 90300
2025-03-29 23:27:41.060101: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng1{k2=2,k3=0} for conv (f32[1,64,1,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[1,500,1,8193]{3,2,1,0}, f32[64,500,1,8193]{3,2,1,0}), window={size=1x8193 pad=0_0x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"c

[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 348ms/step - loss: 0.6103 - mae: 0.5706

Expected: keras_tensor_1
Received: inputs=('Tensor(shape=(None, 8193, 1))',)


[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 440ms/step - loss: 0.6075 - mae: 0.5676 - val_loss: 0.3204 - val_mae: 0.2526 - learning_rate: 0.0010
Epoch 2/10
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 213ms/step - loss: 0.3050 - mae: 0.2399 - val_loss: 0.2367 - val_mae: 0.1671 - learning_rate: 0.0010
Epoch 3/10
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 210ms/step - loss: 0.2712 - mae: 0.2081 - val_loss: 0.2435 - val_mae: 0.1761 - learning_rate: 0.0010
Epoch 4/10
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 217ms/step - loss: 0.2652 - mae: 0.2046 - val_loss: 0.2286 - val_mae: 0.1616 - learning_rate: 0.0010
Epoch 5/10
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 220ms/step - loss: 0.2489 - mae: 0.1882 - val_loss: 0.2290 - val_mae: 0.1656 - learning_rate: 0.0010
Epoch 6/10
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 230ms/step - loss: 0.2303 - mae: 0.1699 - val_loss

<keras.src.callbacks.history.History at 0x7f8b68d669e0>

## Evaluate and save model 

Now let's see how the trained model performs, and save it to the local repository.

In [None]:
# Evaluate the model with our test dataset
model.evaluate(test_dataset)


[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 153ms/step - loss: 0.1961 - mae: 0.1326


[0.19540582597255707, 0.13238614797592163]

## Use the model to predict ATAC-seq coverage

In [None]:
from momics import dataset as mmd
from momics import aggregate as mma

## Predict the ATAC signal from the MNase signal
bb = repo.bins(width=features_size, stride=8, cut_last_bin_out=True)["XVI"]
ds = mmd.MomicsDataset(repo, bb, "mnase_rescaled", batch_size=1000).prefetch(10)
predictions = model.predict(ds)

## Export predictions as a bigwig
centered_bb = bb.copy()
centered_bb.Start = centered_bb.Start + features_size // 2 - target_size // 2
centered_bb.End = centered_bb.Start + target_size
chrom_sizes = repo.chroms(as_dict=True)
keys = [f"{chrom}:{start}-{end}" for chrom, start, end in zip(centered_bb.Chromosome, centered_bb.Start, centered_bb.End)]
res = {f"atac-from-mnase_f{features_size}_s{stride}_t{target_size}": {k: None for k in keys}}
for i, key in enumerate(keys):
    res[f"atac-from-mnase_f{features_size}_s{stride}_t{target_size}"][key] = predictions[i]

mma.aggregate(res, centered_bb, chrom_sizes, type="mean", prefix="prediction")


[1m118/118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 88ms/step


momics :: INFO :: 2025-03-29 23:30:34,206 :: Saved coverage for atac to prediction_atac.bw


{'atac': {'I': array([0., 0., 0., ..., 0., 0., 0.]),
  'II': array([0., 0., 0., ..., 0., 0., 0.]),
  'III': array([0., 0., 0., ..., 0., 0., 0.]),
  'IV': array([0., 0., 0., ..., 0., 0., 0.]),
  'V': array([0., 0., 0., ..., 0., 0., 0.]),
  'VI': array([0., 0., 0., ..., 0., 0., 0.]),
  'VII': array([0., 0., 0., ..., 0., 0., 0.]),
  'VIII': array([0., 0., 0., ..., 0., 0., 0.]),
  'IX': array([0., 0., 0., ..., 0., 0., 0.]),
  'X': array([0., 0., 0., ..., 0., 0., 0.]),
  'XI': array([0., 0., 0., ..., 0., 0., 0.]),
  'XII': array([0., 0., 0., ..., 0., 0., 0.]),
  'XIII': array([0., 0., 0., ..., 0., 0., 0.]),
  'XIV': array([0., 0., 0., ..., 0., 0., 0.]),
  'XV': array([0., 0., 0., ..., 0., 0., 0.]),
  'XVI': array([0., 0., 0., ..., 0., 0., 0.]),
  'Mito': array([0., 0., 0., ..., 0., 0., 0.])}}

This generates a new `bw` file with ATAC-seq coverage over chr16, predicted from MNase-seq coverage.

Here is a screenshot of ATAC-seq coverage track over chr16, from experimental data (darker cyan) or predicted from MNase-seq coverage (MNase: grey track; predicted ATAC: lighter cyan), taken from IGV:

![ATAC-seq coverage track over chr16](images/atac_mnase.png)

A closer look: 

![ATAC-seq coverage track over chr16, zoom](images/atac_mnase2.png)