-
Notifications
You must be signed in to change notification settings - Fork 7
Converting to msbwt's RLE format
The Run-Length Encoded (RLE) format is primarily for reducing the disk space required to store the MSBWT. Additionally, it has the side effect of reducing computation as well, especially in high coverage genomic datasets. Some of our construction algorithms automatically build the BWT directly into this compressed format, typically by using the '-c' option from the command line.
However, there are many other tools for constructing the MSBWT that are not a part of this package. We provide access to a "convert" script that will take as input a MSBWT string and save it into the RLE format we use. For example, the MSBWT of the string "ACAT$" is "T$CAA", so the following command will convert that string to the RLE format we use and store it on disk:
echo -e "T\$CAA" | msbwt convert /path/to/output/msbwt
We note that this tool does not perform any sanity checks on the input given to it, instead just performing the compression and conversion. Giving our conversion tool a MSBWT that does not follow our definition of a MSBWT may have unexpected consequences in downstream queries. The following section briefly describes our MSBWT specification.
- The alphabet of our strings in lexicographic order is [$, A, C, G, N, T], corresponding to the order of symbols in ASCII. Deviations from this ordering may lead to incorrect results from all queries.
- Each string in the string collection has a single '$' symbol (the end-of-string character). Deviations from this specification may lead to errors in string recovery functions.
- All strings are treated as cyclic strings for the purpose of determining order in a suffix array (and thus the MSBWT). Deviations from this ordering may lead to incorrect results from all queries.
- The order of '$' symbols in the MSBWT corresponds to the order in a suffix array. As a result, the '$' can be naively achieved by sorting all strings in the collection. Deviations from this ordering may result in errors in string recovery functions.
We now discuss an example of how to meet these specifications using another MSBWT creation tool called ropebwt2. The tool and research paper can be accessed at the following links:
- Tool - https://github.com/lh3/ropebwt2
- Paper - Fast construction of FM-index for long sequence reads.
The tool has two main differences from the msbwt package. First, its alphabet is ordered [$, A, C, G, T, N]. Second, it allows for different orderings of the '$' symbols as opposed to a single sort-based ordering. Fortunately, both of these differences can be worked around with some basic piping.
gunzip -c reads.fq.gz | awk 'NR % 4 == 2' | sort | gzip > reads.sorted.txt.gz
gunzip -c reads.sorted.txt.gz | tr NT TN | ropebwt2 -LR | tr NT TN | msbwt convert /path/to/output/msbwt
The first line reads a FASTQ file, pulls out just the sequences for each read, sorts them, and then recompresses the output. The intermediate file is compressed and contains all sequences in order with one sequence per line. The second command swaps the 'N' and 'T' symbols, uses ropebwt2 for the MSBWT calculation, swaps the 'N' and 'T' symbols back, and then passes that string to our msbwt convert function. If an intermediate file is unnecessary, then these two lines can be combined into one command:
gunzip -c reads.fq.gz | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | msbwt convert /path/to/output/msbwt
Holt, J., & McMillan, L. (2014). Merging of multi-string BWTs with applications. Bioinformatics, btu584.
Holt, J., & McMillan, L. (2014, September). Constructing burrows-wheeler transforms of large string collections via merging. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 464-471). ACM.
James Holt - holtjma@cs.unc.edu