RecoverY is a tool for shortlisting enriched reads from a sequencing dataset, based on k-mer abundance. Specifically, it can be used for isolating Y-specific reads from a Y flow-sorted dataset.
Before running RecoverY, the following input files are required in ./data folder. Note that currently the names of the "data" folder and of the files are hardcoded into RecoverY.
r1.fastq : Enriched raw reads (first in pair) r2.fastq : Enriched raw reads (second in pair) kmers_from_reads : kmer counts for r1.fastq trusted_kmers : kmers occuring in one copy on the Y chromosome.
A typical run of RecoverY looks like this.
python recoverY.py --read_length 250 --kmer_size 31 --Ymer_match_threshold 50 --threads 8
RecoverY accepts the following parameters.
python recoverY.py [--help] [--read_length READ_LENGTH] [--kmer_size KMER_SIZE] [--Ymer_match_threshold YMER_MATCH_THRESHOLD] [--abundance_threshold ABUNDANCE_THRESHOLD] [--threads THREADS] [--plots]
--kmer_size (default: 25): kmer size used for classifying reads. This must be the same as DSK's k-mer-size. We recommend a value between 25 and 31 for Illumina 150x150 bp reads. This value is used for calculating Ymer_match_threshold below
--read_length (default:150) : The length of longest un-trimmed read input to RecoverY. This value is used for calculating Ymer_match_threshold below
--Ymer_match_threshold : the number of k-mers a read must match to the Ymer table in order to be classified as coming from the Y. The default value is calculated by the formula : 0.4 * (read_len - kmer_size + 1 - (2*kmer_size*read_len/100)). User may change this, but we recommend a value between 20 and 50 for Illumina 150x150 bp reads.
--abundance_threshold : the abundance threshold beyond which all k-mers are assumed to be Ymers. Specify this only if user does not have access to a trusted k-mers file and would like to manually set an abundance threshold.
--threads (default: 2): The number of threads that RecoverY should use.
--help: print usage information.
--plots: generates a k-mer abundance plot in
output/kmerplot.png(requires Matplotlib and Seaborn python packages to be installed, see below). This plot visualizes the abundance threshold selected by RecoverY, by plotting the abundance of raw read k-mers as well as trusted k-mers. See here for an example.
The output of RecoverY are two files:
This is a subset of the read pairs in
data/r2.fastq that are deemed to have originated on the Y chromosome.
git clone https://github.com/makovalab-psu/RecoverY
RecoverY also requires the numpy and biopython python packages in order to run.
Additionally, if you use the
--plot option, RecoverY needs the matplotlib and seaborn packages for python.
However, RecoverY can be run without the matplotlib or seaborn packages, as long as the
--plot option is not used.
These packages can be installed on many systems as follows:
pip install numpy pip install biopython pip install matplotlib pip install seaborn
RecoverY also uses the k-mer counter DSK. The latest DSK binaries (v2.2.0 for Linux 64 bit and v2.2.0 for Mac OSX) are provided in the dependency folder. Thus, if you are using either of these operating systems, DSK need not be installed, and you may use the binaries as provided. For other operating systems, or if alternate versions or functionality of DSK is desired, see https://gatb.inria.fr/software/dsk/.
To test if RecoverY works properly, the
data folder comes with an example dataset.
The example read data is a random subsample of 100,000 reads from a larger simulated data set in which reads from hg38 autosomes are sampled at 6x depth, X chromosome at 3x depth, and Y chromosome at 300x depth (see Section 3.1 in the paper).
To test the example, you must first untar it and then run RecoverY, as follows:
cd data/ tar xf kmers_from_reads.tar.xz cd ../ python recoverY.py
The reads in this example contain their chromosome of origin as part of their FASTQ header. We can therefore check the number of correctly retrieved Y-reads by RecoverY:
grep "@chrY" data/r1.fastq | wc -l grep "@chrY" output/op_r1.fastq | wc -l
The result should be
This indicates that RecoverY correctly identified most of the Y chromosome reads from the sample.
trusted_kmers file contains those k-mers that occur exactly once
in the X-degenerate gene sequences from the human Y chromosome. It can be used without any modifications.
Alternatively, the user can generate their own
trusted_kmers file, for example, by using the X-degenerate gene sequences
from a different species, or using other single-copy regions of a Y chromosome.
The file can be generated by running DSK k-mer counter on the relevant sequences and extracting only k-mers with a count of 1.
trusted_kmers file should contain one kmer per line.
Extra information after the kmer is allowed but ignored.
kmers_from_reads is a file which includes a line for every distinct kmer in
each line contains the kmer sequence followed by whitespace followed by the number of times it occurs in
For example, the first three lines could look like this.
AAAAAAAAAAAAAAAAGAAAAACAA 5 AAAAAAAAAAAAAAACAAGCTGAAT 8 AAAAAAAAAAAAAAAGAAAAACAAA 3
kmers_from_reads file can be generated by the k-mer counter DSK. The ./dependency folder contains DSK binaries for Linux 64 bit and Mac OSX. Usage is as follows (example shown below is for a Linux system) :
cd dependency ./run_dsk_Linux.sh <FASTQ_file> <kmer_size>
If the k-mer counts file for raw reads (r1.fastq) is not already provided, the user may need to generate k-mer counts manually using DSK. To generate k-mer counts with DSK, the following steps are needed :
cd dependency ln -s ../data/r1.fastq # make sure the correct reads file is provided to DSK ./run_dsk_Linux.sh r1.fastq 25
The kmer_counts table will be generated in :
This file can be copied or linked to the data folder so that RecoverY can use it :
cd ../data ln -s ../dependency/kmers_from_reads
RecoverY ignored any k-mer in
kmers_from_reads that has a count less than 3.
If using DSK to generate the file, k-mers with counts less than 3 are discarded by default,
thereby saving space (i.e. --abundance-min is set to 3 by default in DSK) .
The following scripts are included with this distribution of RecoverY, and are automatically run by recovery.py as part of the pipeline. Users may consider them separately for custom needs if required.
a set of general purpose functions to work with kmers
input : trusted_kmers and reads_from_kmers output : Ymer_table with new abundance threshold
input : all raw reads (first in pair) and Ymer table output : Y-specific reads according to RecoverY algorithm (first in pair)
input : all raw reads (second in pair) and Y-specific reads accoding to RecoverY algorithm (first in pair) output : Y-specific reads according to RecoverY algorithm (second in pair)
This program is released under the MIT License. Please see LICENSE.md for details
If you use RecoverY in your research, please cite
Samarth Rangavittal, Robert S. Harris, Monika Cechova, Marta Tomaszkiewicz, Rayan Chikhi, Kateryna Makova, Paul Medvedev
Bioinformatics 2017 https://doi.org/10.1093/bioinformatics/btx771.