This tool estimates the substitution rate between two sequences (e.g., two genomes). Our method is robust to the input sequences with high repetitiveness.
We consider the following random substitution process, parameterized by a rate
Then, given the
Linux (64 bit)
C++17
git clone git@github.com:medvedevgroup/Mutation_rate_estimator.git
cd ./src
make
The compiled executable will be located in ./Mutation_rate_estimator/. You can verify successful installation with:
./Mutation_rate_estimator -h
We provide example data to help you get started. Run the following commands to test the tool:
./Mutation_rate_estimator \
--mode sequence\
--input1 ./example_data/origin_seq.fasta \
--input2 ./example_data/mutated_seq.fasta \
--k 30
Note all the kmers in ./example_data/ are 30-mers.
./Mutation_rate_estimator \
--mode mixture\
--input1 ./example_data/origin_seq.fasta \
--input2 ./example_data/mutated_kmers.fasta \
--k 30
./Mutation_rate_estimator \
--mode kmer \
--input1 ./example_data/origin_kmers.fasta \
--input2 ./example_data/mutated_kmers.fasta \
--dist ./example_data/dist.csv \
--k 30
Provide two complete sequences in separate FASTA files:
./Mutation_rate_estimator \
--mode sequence\
--input1 seq1.fasta \
--input2 seq2.fasta \
--k 31
User provides a complete sequence and a set of
./Mutation_rate_estimator \
--mode mixture\
--input1 seq1.fasta \
--input2 set2.fasta \
--k 31
User provides two sets of
occurrence i, number of kmers with occurrence i
Then use command in folloing form to run
./Mutation_rate_estimator \
--mode kmer \
--input1 set1.fasta \
--input2 set2.fasta \
--dist occ.csv \
--k 31
User can use parameter --theta to speed up the calculation of intersection size of two
The default value of theta is --theta should be provided a value in the range of
--e: Absolute error tolerance for Newton’s method, default as
If you use our tool, pleast cite:
@article {Wu2025.06.19.660607,
author = {Wu, Haonan and Blanca, Antonio and Medvedev, Paul},
title = {A k-mer-based estimator of the substitution rate between repetitive sequences},
elocation-id = {2025.06.19.660607},
year = {2025},
doi = {10.1101/2025.06.19.660607},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/06/25/2025.06.19.660607},
eprint = {https://www.biorxiv.org/content/early/2025/06/25/2025.06.19.660607.full.pdf},
journal = {bioRxiv}
}Or
@InProceedings{wu_et_al:LIPIcs.WABI.2025.20,
author = {Wu, Haonan and Blanca, Antonio and Medvedev, Paul},
title = {{A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences}},
booktitle = {25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
pages = {20:1--20:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-386-7},
ISSN = {1868-8969},
year = {2025},
volume = {344},
editor = {Brejov\'{a}, Bro\v{n}a and Patro, Rob},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.20},
URN = {urn:nbn:de:0030-drops-239465},
doi = {10.4230/LIPIcs.WABI.2025.20},
annote = {Keywords: k-mers, sketching, mutation rates}
}