Skip to content

Graph Convolution based model for de novo identification of nucleotide modifications


Notifications You must be signed in to change notification settings


Repository files navigation

Nanopore modification inference

Pre-print: Towards Inferring Nanopore Sequencing Ionic Currents from Nucleotide Chemical Structures


We develop a model that associates chemical information on nanopore kmer models with their mean pA value. We showcase that the model can learn chemical information in specific contexts which can be transferred to identify de_novo kmer modifications. Our model is implemented with keras.


Our model runs with python3. We recommend to use a recent version of python3 (eg. python>=3.6).
We recommend using conda to create a virtual environment.
Follow the steps bellow to install and replicate our results:

conda create -n ndmi_reproduce python=3.6
conda activate ndmi_reproduce
git clone
python install

To run our multi-GPU enabled GridSearch:

We have developed our own multi-GPU enables grid search with cross validation.
Each parameter combination can be run on a single gpu, thereby accelerating the search for the architecture.

The following line of code accepts a kmer model (-i) table and the number of cross validation folds to be run (-k).
The optimal parameter set is determined by the parameter set that achieves the best average RMSE across all folds.

python -i ./ont_models/r9.4_180mv_70bps_5mer_5to3_RNA.model -k 10
python -i ./ont_models/r9.4_180mv_450bps_6mer_DNA.model -k 10

Downsample Analysis

We have conducted four different kinds of downsample (dropping information from the training data)
analysis to test the limits of our model's performance:

  1. Random kmer downsampling in a 50-fold cross-validation fashion.
  2. Base dropout, where each base is dropped regardless of position on the kmer.
  3. Position-base dropout, where each base is dropped from each of the kmer's positions.
  4. Combination dropout, where a pair of bases is dropped.

In each case, the dropped data appear in the test set.

The script runs 1 & 3:

usage: [-h] [-i FILE] [-cv] [-k FOLDS] [-o OUT] [-v VERBOSITY]
                  [-kmer_cv] [-test_splits SPLITS [SPLITS ...]]

Script takes in a kmer and pA measurement file. The user can select between
random cross validation, or targeted cross validation, where each based is
hidden from each position of the kmer in training. Script saves cross
validation results as a .npy file

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --FILE FILE  kmer file with pA measurement
  -cv, --CV             MODE: Random CV splits of variable size
                        K for fold numbers in cross validation
  -o OUT, --OUT OUT     Full path for .npy file where results are saved
                        Verbosity of model. Other than zero, loss per batch
                        per epoch is printed. Default is 0, meaning nothing is
  -kmer_cv, --KMERCV    MODE: Position-based dropout of each base
  -test_splits SPLITS [SPLITS ...], --SPLITS SPLITS [SPLITS ...]
                        Test splits to run k-fold cross validation over

For example to run the kmer downsample analysis on DNA:

python -i ./ont_models/r9.4_180mv_450bps_6mer_DNA.model -cv -o dna_downsample_results.npy

and the positional dropout analysis:

python -i ./ont_models/r9.4_180mv_450bps_6mer_DNA.model -kmer_cv -o dna_posdrop_results.npy

The script runs 2 & 4:

usage: [-h] [-i FILE] [-base_pair_exclude] [-base_exclude]
                        [-o OUT] -n_type NTYPE

Run base-pair specific dropout cross validation

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --FILE FILE  kmer file with pA measurement
  -base_pair_exclude, --PAIRS
                        MODE: pairs of pases will be excluded
  -base_exclude, --SOLO
                        MODE: each of the four bases will be removed from
  -o OUT, --OUT OUT     Full path for .npy file where results are saved
  -n_type NTYPE, --NTYPE NTYPE
                        Type of nucleotide examined: DNA or RNA

For example to run the base dropout analysis on DNA:

python -i ./ont_models/r9.4_180mv_450bps_6mer_DNA.model -o dna_exclude_base_results.npy -n_type $3 'DNA' -base_exclude

and the base-pair dropout analysis:

python -i ./ont_models/r9.4_180mv_450bps_6mer_DNA.model -o dna_exclude_basepairs_results.npy -n_type $3 'DNA' -base_pair_exclude

Modification prediction analysis

To train the model on only canonical kmers, and predict on all possible M (methylated C) modified kmers run the following line:

python -i  ./ont_models/r9.4_180mv_450bps_6mer_DNA.model -model_fn dna_model -o dna_mod_pred_50repeat_results.npy

To train the model on canonical and fractions of M containing kmers, and then predict all possible M modified kmers run the following line:

python -i ./ont_models/r9.4_180mv_450bps_6mer_DNA.model  -o dna_mod_trainpred_results_50fold.npy

Reproducing paper results

The manuscript's results can be reproduced at once by simply running the following code:
