The code was inspired by the preprint of
A. Blanca et al. (2021)
The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches.
Journal of Computational Biology 29:155-168
DOI: 10.1089/cmb.2021.0431
which used Stein's method to develop CLTs for the k-mer sketch of sequences. Under reasonable assumptions, the k-mer sketch can determine sequence length exactly because the it lists all k-mers in the sequence. In contrast, submers sample the k-mers in a sequence. Thus, the submer count does not determine the sequence length exactly, but yields probabilistic bounds on it through a CLT. Direct use of Stein's method fails for submer sketches because usually they do not determine the sequence length exactly. In principle, however, Stein's method provides an approximate heuristic bound through estimation of the sequence length. In practice, the "estimate method" fails for many parameter values, because the sparsity of submers causes the bound from Stein's method to be too generous. Accordingly, our code relies on a "qualitative method", leaving simulations to estimate the accuracy of the confidence intervals from the CLT convergence.
A. Dutta et al. (2022)
Parameterized syncmer schemes improve long-read mapping
bioRxiv: 2022.01.10.475696
influenced us to generalize our methods for open and closed syncmers to their "parametrized syncmers". The main practical change in the code is in the parameters for syncmers. For example,
python theta-from-closed-syncmer-count.py -k 10 -s 3 -n 100 -u 90 -c 0.95
python theta-from-open-syncmer-count.py -k 10 -s 3 -t 1 -n 100 -u 90 -c 0.95
are now called as
python theta-from-parametrized-syncmer-count.py -k 10 -s 3 -t 0 7 -n 100 -u 90 -c 0.95
python theta-from-parametrized-syncmer-count.py -k 10 -s 3 -t 1 -n 100 -u 90 -c 0.95
where closed (k,s)=(10,3) syncmers are now parametrized (k, s, ts)=(10, 3, '0 7') syncmers,
and open (k,s,t)=(10,3,1) syncmers are now parametrized (k, s, ts)=(10, 3, '1') syncmers.
The closed syncmers indicate by '0 7' that our ts are offset from array index 0 within the k-mer.
The open syncmer (k, s, ts)=(10, 3, '1') parametrizes with its offset at array index 1 (after the start) of 10-mer.
Parametrized syncmers now include an option -e for downsampling, e.g., ' -e 0.1' indicates a downsampling probability eps=0.1.
Omit the option to default to eps=0.0, the corresponding values with no syncmer downsampling.
Our code estimates sequence length and mutation probability per base by using submer counts in central limit theorems. Presently, the submers can be of 3 types: parametrized syncmers, minimizers, or minimally overlapping k-mers. Note, however, the context-dependency of minimizers obstructed the estimation of the corresponding mutation probabilities. The code also calculates the first-occurrence probabilities (the inter-submer distance distribution) for each submer type. A general formula converts the first-occurrence probabilities to the alpha-run probabilities of
J. Shaw & Y.W. Yu (2021)
Theory of local k-mer selection with applications to long-read alignment
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab790/6432031
programs/ contains Python executables and ...make.py drivers, which display calls to the programs.
Output/ contains the output from the ...make.py drivers, so diff can verify the executables by comparing their output files ...log.
modules/ contains Python modules and classes, tested by a main() to demonstrate calls to the module subroutines.
programs/ Executables have an -h (help) option to explain their arguments.
- distance-distribution...py outputs first-occurrence probabilities. If the '-y' flag is set, it outputs alpha-test probabilities.
- length-from...py inputs the submer count of a sequence and outputs a confidence interval for its length.
- theta-from...py inputs the submer counts of a reference sequence and a mutated version and outputs a confidence interval for the mutation probability per letter.
modules/ contains the files performing the computations. A brief summary of the most important of these files follow.
-
jls_submer_clt_mgr.py (manager file for submer CLTs) is the main programming interface. Please refer to its comments on the arguments and the return of its subroutines for more information on the CLTs.
-
For all types of submers, let the indicator Yi = 1 if the i-th k-mer is a submer, and 0 otherwise. The CLTs depend on the the autocovariance function cov[Y0,Yi]. The base class file submer.py calculates the autocovariance from the expected products E[Y0Yi] provided by the derived class files for each submer type (jls_syncmer_parametrized.py, jls_minimizer.py, and jls_non_overlapping_pattern_prob.py).
Some miscellaneous topics
- The α-test probabilities of Shaw & Yu
The derived class files (jls_syncmer_parametrized.py, jls_syncmer_minimizer.py, and jls_non_overlapping_pattern_prob.py) calculate first-occurrence probabilities (inter-submer distance distribution)
fi = Pr{ Yi = 1 and Yj = 0 (0 < j < i) | Y0 = 1 }.
In contrast, the expected products E[Y0Yi] leave Yj for 0 < j < i unrestricted.
submer.py interconverts the α-test probabilities Pr(f,α) of Shaw & Yu (2021) and first-passage probabilities fi, which determine each other according to relatively simple formulas. Shaw & Yu (2021) give general four-variable recursions for the α-test probabilities, and Dutta et al (2022) other recursions, but the recursions given here are much faster.
- Window probabilities for (w,k)-minimizers
Minimizers have a window guarantee: if two sequences share a (w+k-1)-mer, then they share a (w,k)-minimizer. Subsequences of length α+k-1 only have a window probability π(α), where π(α) = 1 for α ≥ w; and π(α) < 1 otherwise. jls_minimizer_common_submer.py calculates π(α).