# Predict genomic coverage from DNA sequence

Because `momics` can ingest reference genome sequence as well as genomic coverage data, we can use it to predict genomic coverage from DNA sequence. This is useful for generating synthetic data or for filling in missing data in a dataset.

## Connect to the data repository

Here again, we will tap into the repository generated in the [previous tutorial](integrating-multiomics.ipynb). 

In [None]:
from momics.momics import Momics

## Creating repository
repo = Momics("yeast_CNN_data.momics")

## Check that sequence and some tracks are registered
repo.seq()
repo.tracks()


momics :: INFO :: 2025-03-29 23:32:13,019 :: No cloud config found for momics.Consider populating `~/.momics.ini` file with configuration settings for cloud access.


Unnamed: 0,idx,label,path
0,0,atac,/home/jaseriza/repos/momics/data/S288c_atac.bw
1,1,scc1,/home/jaseriza/repos/momics/data/S288c_scc1.bw
2,2,mnase,/home/jaseriza/repos/momics/data/S288c_mnase.bw
3,3,atac_rescaled,tmpbyibsue1
4,4,mnase_rescaled,tmpoqeo1dwd


## Define datasets and model 

We will define a simple convolutional neural network with `tensorflow` to predict the target variable `mnase` from the feature variable `seq` (the genome reference sequence). This requires to first define a set of genomic coordinates to extract genomic data from. We will extract sequences over tiling genomic windows (`features_size` of `8193`, with a stride of `48`) as feature variables to predict `mnase_rescaled` coverage scores over the same tiling genomic windows, but narrowed down to the a `target_size` of `128` bp around the center of the window. We can split the data into training, testing and validation sets, using `momics.utils.split_ranges()`.

In [None]:
import momics.utils as mutils
from momics.dataset import MomicsDataset

# Fetch data from the momics repository
features = "nucleotide"
target = "mnase_rescaled"
features_size = 2048 + 1
stride = 512
target_size = 32
batch_size = 500

bins = repo.bins(width=features_size, stride=stride, cut_last_bin_out=True)
bins = bins.subset(lambda x: x.Chromosome != "XVI")
bins_split, bins_test = mutils.split_ranges(bins, 0.8, shuffle=False)
bins_train, bins_val = mutils.split_ranges(bins_split, 0.8, shuffle=False)

train_dataset = (
    MomicsDataset(repo, bins_train, features, target, target_size=target_size, batch_size=batch_size).prefetch(20).repeat()
)
val_dataset = MomicsDataset(repo, bins_val, features, target, target_size=target_size, batch_size=batch_size)
test_dataset = MomicsDataset(repo, bins_test, features, target, target_size=target_size, batch_size=batch_size)


2025-03-29 23:32:13.365822: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-29 23:32:13.376873: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743287533.390698 4122550 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743287533.394535 4122550 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-29 23:32:13.410782: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

Now is time to define the model architecture. In this example, we will use a neural network adapted from `Basenji`, pre-defined in `momics.nn`. We can instantiate the model, and compile it with the desired optimizer, loss function and metrics.

In [None]:
from momics.nn import ChromNN
from momics.nn import mae_cor
import tensorflow as tf  # type: ignore
from tensorflow.keras import layers  # type: ignore

model = ChromNN(
    input=layers.Input(shape=(features_size, 4)),
    output=layers.Dense(target_size, activation="linear"),
    filters=[64, 16, 8],
    kernel_sizes=[3, 8, 80],
).model


def loss_fn(y_true, y_pred):
    return mae_cor(y_true, y_pred, alpha=0.8)


model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=loss_fn,
    metrics=["mae"],
)
model.summary()


## Fit the model 

Now that we have the datasets and the model, we can fit the model to the training data, using the `fit()` method of the model. We can also evaluate the model on the testing and validation datasets.

In [4]:
import numpy as np
from pathlib import Path
from tensorflow.keras.callbacks import CSVLogger, EarlyStopping, ModelCheckpoint, ReduceLROnPlateau  # type: ignore

callbacks_list = [
    CSVLogger(Path(".chromnn", "seq2cov.epoch_data.csv")),
    ModelCheckpoint(filepath=Path(".chromnn", "Checkpoint.seq2cov.keras"), monitor="val_loss", save_best_only=True),
    EarlyStopping(monitor="val_loss", patience=40, min_delta=1e-5, restore_best_weights=True),
    ReduceLROnPlateau(monitor="val_loss", factor=0.1, patience=6 // 2, min_lr=0.1 * 0.001),
]
model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=30,
    callbacks=callbacks_list,
    steps_per_epoch=int(np.floor(len(bins_train) // batch_size)),
)


Epoch 1/30


Expected: keras_tensor_1
Received: inputs=('Tensor(shape=(None, 2049, 4))',)
I0000 00:00:1743287536.921523 4122713 service.cc:148] XLA service 0x654090de3f00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1743287536.921543 4122713 service.cc:156]   StreamExecutor device (0): NVIDIA RTX A2000 12GB, Compute Capability 8.6
2025-03-29 23:32:16.976294: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1743287537.313120 4122713 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m 2/27[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m1s[0m 53ms/step - loss: 1.1097 - mae: 1.1375

I0000 00:00:1743287542.748312 4122713 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step - loss: 1.0124 - mae: 1.0163

Expected: keras_tensor_1
Received: inputs=('Tensor(shape=(None, 2049, 4))',)


[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 180ms/step - loss: 1.0077 - mae: 1.0105 - val_loss: 0.4940 - val_mae: 0.3675 - learning_rate: 0.0010
Epoch 2/30
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 92ms/step - loss: 0.4946 - mae: 0.3724 - val_loss: 0.4299 - val_mae: 0.2902 - learning_rate: 0.0010
Epoch 3/30
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 179ms/step - loss: 0.3889 - mae: 0.2445 - val_loss: 0.3885 - val_mae: 0.2399 - learning_rate: 0.0010
Epoch 4/30
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 182ms/step - loss: 0.3699 - mae: 0.2248 - val_loss: 0.3667 - val_mae: 0.2177 - learning_rate: 0.0010
Epoch 5/30
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 193ms/step - loss: 0.3603 - mae: 0.2173 - val_loss: 0.3561 - val_mae: 0.2087 - learning_rate: 0.0010
Epoch 6/30
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 174ms/step - loss: 0.3529 - mae: 0.2122 - val_loss: 0.35

<keras.src.callbacks.history.History at 0x729ef7f85db0>

## Evaluate and save model 

Now let's see how the trained model performs, and save it to the local repository.

In [5]:
# Evaluate the model
model.evaluate(test_dataset)


[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 256ms/step - loss: 0.2934 - mae: 0.1684


[0.29288387298583984, 0.16718727350234985]

## Use the model to predict MNase-seq coverage

We can now use our trained model to predict ATAC-seq coverage from MNase-seq coverage, for example on a chromosome which has not been used for training.

In [6]:
from momics import dataset as mmd
from momics import aggregate as mma

## Now predict the ATAC signal from the MNase signal
bb = repo.bins(width=features_size, stride=8, cut_last_bin_out=True)["XVI"]
ds = mmd.MomicsDataset(repo, bb, "nucleotide", batch_size=1000).prefetch(10)
predictions = model.predict(ds)

## Export predictions as a bigwig
centered_bb = bb.copy()
centered_bb.Start = centered_bb.Start + features_size // 2 - target_size // 2
centered_bb.End = centered_bb.Start + target_size
chrom_sizes = repo.chroms(as_dict=True)
keys = [f"{chrom}:{start}-{end}" for chrom, start, end in zip(centered_bb.Chromosome, centered_bb.Start, centered_bb.End)]
res = {f"mnase-from-seq_f{features_size}_s{stride}_t{target_size}": {k: None for k in keys}}
for i, key in enumerate(keys):
    res[f"mnase-from-seq_f{features_size}_s{stride}_t{target_size}"][key] = predictions[i]

mma.aggregate(res, centered_bb, chrom_sizes, type="mean", prefix="prediction")


[1m119/119[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 182ms/step


momics :: INFO :: 2025-03-29 23:35:19,973 :: Saved coverage for mnase-from-seq_f2049_s512_t32 to prediction_mnase-from-seq_f2049_s512_t32.bw


{'mnase-from-seq_f2049_s512_t32': {'I': array([0., 0., 0., ..., 0., 0., 0.]),
  'II': array([0., 0., 0., ..., 0., 0., 0.]),
  'III': array([0., 0., 0., ..., 0., 0., 0.]),
  'IV': array([0., 0., 0., ..., 0., 0., 0.]),
  'V': array([0., 0., 0., ..., 0., 0., 0.]),
  'VI': array([0., 0., 0., ..., 0., 0., 0.]),
  'VII': array([0., 0., 0., ..., 0., 0., 0.]),
  'VIII': array([0., 0., 0., ..., 0., 0., 0.]),
  'IX': array([0., 0., 0., ..., 0., 0., 0.]),
  'X': array([0., 0., 0., ..., 0., 0., 0.]),
  'XI': array([0., 0., 0., ..., 0., 0., 0.]),
  'XII': array([0., 0., 0., ..., 0., 0., 0.]),
  'XIII': array([0., 0., 0., ..., 0., 0., 0.]),
  'XIV': array([0., 0., 0., ..., 0., 0., 0.]),
  'XV': array([0., 0., 0., ..., 0., 0., 0.]),
  'XVI': array([0., 0., 0., ..., 0., 0., 0.]),
  'Mito': array([0., 0., 0., ..., 0., 0., 0.])}}