# Train neural networks with `momics` and `tensorflow`

`momics` provides several useful resources to train neural networks with `tensorflow`. This notebook demonstrates how to train a simple neural network with `momics` and `tensorflow`.

## Connect to the data repository

We will tap into the repository generated in the [previous tutorial](integrating-multiomics.ipynb). 

In [1]:
from momics.momics import Momics

## Creating repository
repo = Momics("yeast_CNN_data.momics")

## Check that sequence and some tracks are registered
repo.seq()
repo.tracks()

momics :: INFO :: 2025-03-29 14:03:01,109 :: No cloud config found for momics.Consider populating `~/.momics.ini` file with configuration settings for cloud access.


Unnamed: 0,idx,label,path
0,0,atac,/home/jaseriza/repos/momics/data/S288c_atac.bw
1,1,scc1,/home/jaseriza/repos/momics/data/S288c_scc1.bw
2,2,mnase,/home/jaseriza/repos/momics/data/S288c_mnase.bw
3,3,atac_rescaled,tmpy_sm6gcr
4,4,mnase_rescaled,tmp2w4qf00f


## Modify some tracks

We can first pre-process the tracks to normalize them, and save them back to the local repository.

In [2]:
import numpy as np

for track in ["atac", "mnase"]:
    cov = repo.tracks(track)
    # Compute genome-wide 99th percentile
    q99 = np.nanpercentile(np.concatenate(list(cov.values())), 99)
    for chrom in cov.keys():
        arr = cov[chrom]
        # Truncate to genome-wide 99th percentile
        arr = np.minimum(arr, q99)
        # Rescale to [0, 1]
        arr = (arr - np.nanmin(arr)) / (np.nanmax(arr) - np.nanmin(arr))
        # Convert NaNs to 0
        arr = np.nan_to_num(arr, nan=0)
        # Store back
        cov[chrom] = arr
    repo.ingest_track(cov, track + "_rescaled")

repo.tracks()

momics :: INFO :: 2025-03-27 08:10:18,821 :: 1 tracks ingested in 1.3552s.
momics :: INFO :: 2025-03-27 08:10:23,770 :: 1 tracks ingested in 1.6064s.


Unnamed: 0,idx,label,path
0,0,atac,/home/jaseriza/repos/momics/data/S288c_atac.bw
1,1,scc1,/home/jaseriza/repos/momics/data/S288c_scc1.bw
2,2,mnase,/home/jaseriza/repos/momics/data/S288c_mnase.bw
3,3,atac_rescaled,tmpy_sm6gcr
4,4,mnase_rescaled,tmp2w4qf00f


## Define datasets and model 

We will define a simple convolutional neural network with `tensorflow` to predict the target variable `ATAC` from the feature variable `MNase`. This requires to first define a set of genomic coordinates to extract genomic data from. We will use `MNase_rescaled` coverage scores over tiling genomic windows (`features_size` of `1025`, with a stride of `48`) as feature variables to predict `ATAC_rescaled` coverage scores over the same tiling genomic windows, but narrowed down to the a `target_size` of `24` bp around the center of the window. We can split the data into training, testing and validation sets, using `momics.utils.split_ranges()`.

In [None]:
import momics.utils as mutils

# Fetch data from the momics repository
features_size = 8192 + 1
stride = 48

bins = repo.bins(width=features_size, stride=stride, cut_last_bin_out=True)
bins = bins.subset(lambda x: x.Chromosome != "XVI")
bins_split, bins_test = mutils.split_ranges(bins, 0.8)
bins_train, bins_val = mutils.split_ranges(bins_split, 0.8)
bins_train

Unnamed: 0,Chromosome,Start,End
0,I,157728,165921
1,I,221952,230145
2,I,98976,107169
3,I,82320,90513
4,I,213120,221313
...,...,...,...
147706,XV,766512,774705
147707,XV,424752,432945
147708,XV,436464,444657
147709,XV,750192,758385


We now need to define different datasets, for training, testing and validation. We will use `momics.dataset.MomicsDataset()` constructor, indicating the batch size we wish to use in the training process.

In [None]:
from momics.dataset import MomicsDataset

features = "mnase_rescaled"
target = "atac_rescaled"
target_size = 512
batch_size = 500

train_dataset = (
    MomicsDataset(repo, bins_train, features, target, target_size=target_size, batch_size=batch_size)
    .shuffle(10)
    .prefetch(2)
    .repeat()
)
val_dataset = MomicsDataset(repo, bins_val, features, target, target_size=target_size, batch_size=batch_size).repeat()
test_dataset = MomicsDataset(repo, bins_test, features, target, target_size=target_size, batch_size=batch_size)
train_dataset

<_RepeatDataset element_spec=((TensorSpec(shape=(None, 8193, 1), dtype=tf.float32, name=None),), (TensorSpec(shape=(None, 1024, 1), dtype=tf.float32, name=None),))>

Now is time to define the model architecture. In this example, we will use a simple convolutional neural network (`ChromNN`), pre-defined in `momics.nn`. We can instantiate the model, and compile it with the desired optimizer, loss function and metrics.

In [None]:
from momics.nn import ChromNN
from momics.nn import mae_cor
import tensorflow as tf  # type: ignore
from tensorflow.keras import layers  # type: ignore

model = ChromNN(
    input=layers.Input(shape=(features_size, 1)),
    output=layers.Dense(target_size, activation="linear"),
    filters=[64, 16, 8],
    kernel_sizes=[3, 8, 80],
).model


def loss_fn(y_true, y_pred):
    return mae_cor(y_true, y_pred, alpha=0.9)


model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=loss_fn,
    metrics=["mae"],
)
model.summary()

## Fit the model 

Now that we have the datasets and the model, we can fit the model to the training data, using the `fit()` method of the model. We can also evaluate the model on the testing and validation datasets.

In [None]:
import numpy as np
from pathlib import Path
from tensorflow.keras.callbacks import CSVLogger, EarlyStopping, ModelCheckpoint, ReduceLROnPlateau  # type: ignore

callbacks_list = [
    CSVLogger(Path(".chromnn", "epoch_data.csv")),
    ModelCheckpoint(filepath=Path(".chromnn", "Checkpoint.keras"), monitor="val_loss", save_best_only=True),
    EarlyStopping(monitor="val_loss", patience=40, min_delta=1e-5, restore_best_weights=True),
    ReduceLROnPlateau(monitor="val_loss", factor=0.1, patience=6 // 2, min_lr=0.1 * 0.001),
]
model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=30,
    callbacks=callbacks_list,
    steps_per_epoch=int(np.floor(len(bins_train) // batch_size)),
    validation_steps=int(np.floor(len(bins_val) // batch_size)),
)

Epoch 1/30


Expected: keras_tensor_1
Received: inputs=('Tensor(shape=(None, 8193, 1))',)
I0000 00:00:1743059630.815157 1580176 service.cc:148] XLA service 0x790684008420 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1743059630.815179 1580176 service.cc:156]   StreamExecutor device (0): NVIDIA RTX A2000 12GB, Compute Capability 8.6
2025-03-27 08:13:50.900800: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1743059631.167607 1580176 cuda_dnn.cc:529] Loaded cuDNN version 90300










2025-03-27 08:14:00.775976: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng1{k2=2,k3=0} for conv (f32[1,32,1,5]{3,2,1,0}, u8[0]{0}) custom-call(f32[1,500,1,8193]{3,2,1,0}, f32[32,500,1,8193]{3,2,1,0}), window={size=1x8193 pad=0_0x2_2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_

[1m  1/295[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m1:35:40[0m 20s/step - loss: 1.8178

I0000 00:00:1743059648.119502 1580176 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m287/295[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m1s[0m 164ms/step - loss: 0.3917













[1m295/295[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 201ms/step - loss: 0.3857

Expected: keras_tensor_1
Received: inputs=('Tensor(shape=(None, 8193, 1))',)


[1m295/295[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 227ms/step - loss: 0.3850 - val_loss: 0.0380 - learning_rate: 0.0010
Epoch 2/30
[1m295/295[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 184ms/step - loss: 0.0470 - val_loss: 0.0363 - learning_rate: 0.0010
Epoch 3/30
[1m295/295[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 187ms/step - loss: 0.0339 - val_loss: 0.0403 - learning_rate: 0.0010
Epoch 4/30
[1m295/295[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 191ms/step - loss: 0.0318 - val_loss: 0.0442 - learning_rate: 0.0010
Epoch 5/30
[1m295/295[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 197ms/step - loss: 0.0329 - val_loss: 0.0284 - learning_rate: 0.0010
Epoch 6/30
[1m295/295[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 197ms/step - loss: 0.0298 - val_loss: 0.0266 - learning_rate: 0.0010
Epoch 7/30
[1m295/295[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 197ms/step - loss: 0.0280 - val_loss: 0.0247 - lear

KeyboardInterrupt: 

## Evaluate and save model 

Now let's see how the trained model performs, and save it to the local repository.

In [9]:
# Evaluate the model
model.evaluate(test_dataset)

[1m93/93[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 111ms/step - loss: 0.0233


2025-03-27 08:23:09.346224: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
2025-03-27 08:23:09.346770: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
	 [[IteratorGetNext/_4]]


0.023295924067497253

## Use the model to predict ATAC-seq coverage

We can now use our trained model to predict ATAC-seq coverage from MNase-seq coverage, for example on a chromosome which has not been used for training.

In [None]:
from momics.query import MomicsQuery
from momics.aggregate import aggregate

## Define tiling 1025 bp windows, with a stride of 1 bp, and extract MNase data from it.
bb = repo.bins(width=features_size, stride=8, cut_last_bin_out=True)["XVI"]
dat = MomicsQuery(repo, bb).query_tracks(tracks=["mnase_rescaled"])
dat = np.array(list(dat.coverage["mnase_rescaled"].values()))

## Now predict the ATAC signal from the MNase signal
predictions = model.predict(dat)

## Export predictions as a bigwig
bb2 = bb.copy()
bb2.Start = bb2.Start + features_size // 2 - target_size // 2
bb2.End = bb2.Start + target_size
chrom_sizes = {chrom: length for chrom, length in zip(repo.chroms().chrom, repo.chroms().length)}
keys = [f"{chrom}:{start}-{end}" for chrom, start, end in zip(bb2.Chromosome, bb2.Start, bb2.End)]
res = {"atac": {k: None for k in keys}}
for i, key in enumerate(keys):
    res["atac"][key] = predictions[i]

aggregate(res, bb2, chrom_sizes, type="mean", prefix="prediction")

[1m3672/3672[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 3ms/step


momics :: INFO :: 2025-03-27 08:24:21,442 :: Saved coverage for atac to prediction_atac.bw


{'atac': {'I': array([0., 0., 0., ..., 0., 0., 0.]),
  'II': array([0., 0., 0., ..., 0., 0., 0.]),
  'III': array([0., 0., 0., ..., 0., 0., 0.]),
  'IV': array([0., 0., 0., ..., 0., 0., 0.]),
  'V': array([0., 0., 0., ..., 0., 0., 0.]),
  'VI': array([0., 0., 0., ..., 0., 0., 0.]),
  'VII': array([0., 0., 0., ..., 0., 0., 0.]),
  'VIII': array([0., 0., 0., ..., 0., 0., 0.]),
  'IX': array([0., 0., 0., ..., 0., 0., 0.]),
  'X': array([0., 0., 0., ..., 0., 0., 0.]),
  'XI': array([0., 0., 0., ..., 0., 0., 0.]),
  'XII': array([0., 0., 0., ..., 0., 0., 0.]),
  'XIII': array([0., 0., 0., ..., 0., 0., 0.]),
  'XIV': array([0., 0., 0., ..., 0., 0., 0.]),
  'XV': array([0., 0., 0., ..., 0., 0., 0.]),
  'XVI': array([0., 0., 0., ..., 0., 0., 0.]),
  'Mito': array([0., 0., 0., ..., 0., 0., 0.])}}

: 

This generates a new `bw` file with ATAC-seq coverage over chr16, predicted from MNase-seq coverage.

Here is a screenshot of ATAC-seq coverage track over chr16, from experimental data (darker cyan) or predicted from MNase-seq coverage (MNase: grey track; predicted ATAC: lighter cyan), taken from IGV:

![ATAC-seq coverage track over chr16](images/atac_mnase.png)

A closer look: 

![ATAC-seq coverage track over chr16, zoom](images/atac_mnase2.png)