paper/paper.tex

\documentclass[smallabstract,smallcaptions]{dccpaper}

\usepackage[utf8]{inputenc}

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{textgreek}
\usepackage{url}

\newlength{\figurewidth}
\newlength{\smallfigurewidth}
\setlength{\smallfigurewidth}{2.75in}
\setlength{\figurewidth}{6in}


% Mathematics
\newcommand{\set}[1]{\ensuremath{\{ #1 \}}}
\newcommand{\abs}[1]{\ensuremath{\lvert #1 \rvert}}
\newcommand{\Oh}{\ensuremath{\mathsf{O}}}
\newcommand{\oh}{\ensuremath{\mathsf{o}}}
\newcommand{\Th}{\ensuremath{\mathsf{\Theta}}}

% Structures
\newcommand{\SA}{\textsf{SA}}
\newcommand{\ISA}{\textsf{ISA}}
\newcommand{\BWT}{\textsf{BWT}}
\newcommand{\FMI}{\textsf{FMI}}
\newcommand{\C}{\textsf{C}}
\newcommand{\RA}{\textsf{RA}}
\newcommand{\mSA}{\ensuremath{\mathsf{SA}}}
\newcommand{\mISA}{\ensuremath{\mathsf{ISA}}}
\newcommand{\mBWT}{\ensuremath{\mathsf{BWT}}}
\newcommand{\mC}{\ensuremath{\mathsf{C}}}
\newcommand{\mRA}{\ensuremath{\mathsf{RA}}}

% Operations
\newcommand{\LF}{\textsf{LF}}
\newcommand{\rank}{\textsf{rank}}
\newcommand{\select}{\textsf{select}}
\newcommand{\mLF}{\ensuremath{\mathsf{LF}}}
\newcommand{\mrank}{\ensuremath{\mathsf{rank}}}
\newcommand{\mselect}{\ensuremath{\mathsf{select}}}

% Shorthands
\newcommand{\Acoll}{\ensuremath{\mathcal{A}}}
\newcommand{\Bcoll}{\ensuremath{\mathcal{B}}}

% Names
\newcommand{\BWTmerge}{\textsf{BWT\nobreakdash-merge}}
\newcommand{\ropebwt}{\textsf{RopeBWT}}
\newcommand{\ropebwtii}{\textsf{RopeBWT2}}
\newcommand{\CEU}{\textsf{CEU}}
\newcommand{\RS}{\textsf{RS}}


\begin{document}

\title
{\large
\textbf{Burrows-Wheeler transform for terabases}
}


\author{%
Jouni Sirén\thanks{This work was supported by the Wellcome Trust grant [098051].} \\[0.5em]
{\small\begin{minipage}{\linewidth}\begin{center}
\begin{tabular}{c}
Wellcome Trust Sanger Institute \\
Wellcome Trust Genome Campus \\
Hinxton, Cambridge, CB10 1SA, UK \\
\url{jouni.siren@iki.fi}
\end{tabular}
\end{center}\end{minipage}}
}


\maketitle
\thispagestyle{empty}


\begin{abstract}
In order to avoid the reference bias introduced by mapping reads to a reference genome, bioinformaticians are investigating reference-free methods for analyzing sequenced genomes. With large projects sequencing thousands of individuals, this raises the need for tools capable of processing terabases of sequence data. A key method is the Burrows-Wheeler transform (BWT), which is widely used for compressing and indexing reads. We propose a practical algorithm for building the BWT of a large read collection by merging the BWTs of smaller subcollections. The algorithm is capable of merging over 700 Gbp/day on a single system, with 20--25 gigabytes of memory overhead on top of the run-length encoded BWTs.
\end{abstract}


\Section{Introduction}

The dramatic decrease in the cost of DNA sequencing has flooded the world with \emph{sequence data}. The \emph{1000 Genomes Project} \cite{1000GP2015} sequenced the genomes of over 2500 humans, and there are many other projects that are similar or greater in scale. For each sequenced genome, a sequencing machine produces a large number of \emph{reads} (short sequences) that cover the genome many times over. For a 3~Gbp human genome, the total length of the reads is often 100~Gbp or more.

\emph{De novo assembly} of sequenced genomes is still too difficult to be routinely done. As a practical alternative, bioinformaticians typically align the reads are aligned to a \emph{reference genome} of the same species. Because most reference genomes have been derived from the genomes of a small number of individuals, this introduces \emph{reference bias}, which may adversely affect the results of subsequent analysis. Switching from reference sequences to \emph{reference graphs} can help to reduce the bias, but such transition will likely take years \cite{Church2015}.

The \emph{ReadServer project} at the Wellcome Trust Sanger Institute aims to avoid the reference bias entirely by developing tools for large-scale \emph{reference-free} genome analysis. Unique reads are compressed and indexed using the \emph{Burrows-Wheeler transform} (\BWT) \cite{Burrows1994}, while metadata databases contain information about the original reads. In the pilot phase, the ReadServer project works with the low-coverage and exome data from phase 3 of the 1000 Genomes Project. After error correction and trimming the reads to either 73~bp or 100~bp, the 86~Tbp of raw data consisting of 922~billion original reads is reduced to 4.88~Tbp and 53.0~billion unique sequences. These sequences are stored in 16 \BWT-based indexes \cite{Ferragina2005a} taking a total of 561.5~gigabytes.

Preprocessing large datasets can take days or even weeks. As a result, it is not always feasible to rebuild everything when new methods of analysis require new functionalities from the indexes. \BWT-based methods were chosen due to their versatility. A \emph{run-length encoded} \BWT{} compresses repetitive sequence collections quite well \cite{Maekinen2010}, while the similarities to the suffix tree and the suffix array make \BWT-based indexes suitable for a wide variety of \emph{pattern matching} and \emph{sequence analysis} tasks \cite{Ohlebusch2013,Maekinen2015}.

ReadServer stores the unique reads in 16 \BWT-based indexes according to the last two bases of the read. This means that every query must be repeated 16 times, while the indexes require more space, as the compression cannot take advantage of the similarities between the reads in different indexes. Reducing the number of indexes would improve both index size and performance. However, the lack of \BWT{} construction algorithms that work with terabases of sequence data has made it infeasible so far.

There are four often contradictory requirements for large-scale \BWT{} construction:
\begin{itemize}

\item \textbf{Speed.} Larger datasets require faster algorithms. As a rough guideline, an algorithm processing 1~Mbp/s is good for up to 100~Gbp, while remaining somewhat useful until 1~Tbp of data.

\item \textbf{Memory.} We may have to process $n$~bp datasets on systems with less than $n$~bits of memory.

\item \textbf{Hardware.} A single node in a computer cluster typically has tens of CPU cores, hundreds of gigabytes of memory, a limited amount of local disk space, and access to a large amount of shared disk space with no performance guarantees. Algorithms using a GPU or a large amount of fast disk space require special-purpose hardware.

\item \textbf{Efficiency.} Large \BWT{}s can be built by doing a lot of redundant work on multiple nodes. As most computer clusters do not have large amounts of unused capacity, such inefficient algorithms are unsuitable for repeated use.

\end{itemize}

The most straightforward approach to \BWT{} construction is to build a \emph{suffix array} for the sequences using a fast general-purpose algorithm \cite{Mori2008,Nong2011}, and then transform the suffix array into the \BWT. These algorithms cannot be used with large datasets, as they require several times more memory than the size of the dataset. Suffix arrays can be built on disk \cite{Gonnet1992}, but even the fastest algorithms cannot index more than 1\nobreakdash--2~Mbp/s \cite{Bingmann2013,Kaerkkaeinen2014a,Nong2014,Kaerkkaeinen2015a}.

There are several algorithms for constructing the \BWT{} \emph{directly} without having to build the suffix array first. They typically require only a limited amount of working space on top of a plain or compressed representation of the \BWT{} \cite{Hon2007,Kaerkkaeinen2007,Siren2009,Okanohara2009} or use the disk as additional working space \cite{Ferragina2012,Beller2013}. General-purpose implementations of these algorithms rarely exceed 1\nobreakdash--2~Mbp/s with large datasets.

\emph{Specialized algorithms} for DNA sequences are much faster. Some algorithms achieve 5\nobreakdash--10~Mbps on general-purpose hardware, with their memory usage becoming a major bottleneck for datasets larger than about 1~Tbp \cite{Bauer2013,Li2014a}. GPU-based algorithms can be much faster, but their memory usage is also greater \cite{Liu2014,Pantaleoni2014}. Distributing the \BWT{} construction on multiple nodes can remove the obvious bottlenecks, but such algorithms generally do not use the hardware as efficiently as those running on  a single system \cite{Wang2015}.

In this paper, we propose a practical algorithm for building the \BWT{} for terabases of sequence data. The algorithm is based on splitting the sequence collection into a number of smaller subcollections, building the \BWT{} for each subcollection, and \emph{merging} the \BWT{}s into a single structure \cite{Siren2009}. The merging algorithm is faster than \BWT{} construction for the subcollections, while having a relatively small memory overhead on top of the final \BWT-based index. As the index must be loaded in memory before it can be used, index construction can be done on the same system as the index is used.


\Section{Background}

A \emph{string} $S[1,n] = s_{1} \dotsm s_{n}$ is a sequence of \emph{characters} over an \emph{alphabet} $\Sigma = \set{1, \dotsc, \sigma}$. For indexing purposes, we consider \emph{text} strings $T[1,n]$ terminated by an endmarker $T[n] = \$ = 0$ not occurring anywhere else in the text. \emph{Binary} sequences are strings over the alphabet $\set{0, 1}$. A \emph{substring} of string $S$ is a sequence of the form $S[i,j] = s_{i} \dotsm s_{j}$. Substrings of the type $S[1,j]$ and $S[i,n]$ are called \emph{prefixes} and \emph{suffixes}, respectively.

The \emph{suffix array} (\SA) \cite{Manber1993} is a simple full-text index of a string. Given a text $T$, its suffix array $\mSA_{T}[1,n]$ is an array of pointers to the suffixes of the text in \emph{lexicographic order}. (If the text is evident from the context, we will omit it and write just \SA.) We can build the suffix array in $\Oh(n)$ time using approximately $n(\log n + \log \sigma + 2)$ bits of space, or just $2n$ bits of working space over the text and the suffix array \cite{Nong2011}. Given a \emph{pattern} $P$, we can find the \emph{lexicographic range} $[sp,ep]$ of suffixes of the text prefixed by the pattern in $\Oh(\abs{P} \log n)$ time by binary searching in the suffix array. The corresponding range of pointers $\mSA[sp,ep]$ lists the \emph{occurrences} of the pattern in the text.

While the suffix array is simple, it requires several times more memory than the original text. For large texts, this can be a serious drawback. The more space-efficient alternatives to the suffix array are based on the \emph{Burrows-Wheeler transform} (\BWT) \cite{Burrows1994}, an easily reversible permutation of the text with a similar combinatorial structure to the suffix array. Given a text $T[1,n]$ and its suffix array, we can easily produce the \BWT{} as $\mBWT[i] = T[\mSA[i]-1]$ (with $\mBWT[i] = T[n]$, if $\mSA[i] = 1$).

A key observation is that if suffix $X$ precedes suffix $Y$ in lexicographic order, suffix $cX$ will also precede suffix $cY$, for any character $c$. Therefore, if $\mBWT[i]$ is the $j$\nobreakdash-th occurrence of character $c$ in the \BWT, and $\mSA[i]$ points to suffix $X$, suffix $cX$ is the $j$\nobreakdash-th suffix starting with $c$ in lexicographic order. This forms the essence of \LF\emph{\nobreakdash-mapping}. The suffix preceding $T[\mSA[i],n]$ in text order is the one starting at $\mSA[\mLF(i)]$, where
$$
\mLF(i) = \mC[\mBWT[i]] + \mBWT.\mrank(i, \mBWT[i]),
$$
$\mC[c]$ is the number of suffixes starting with a character smaller than $c$, and $S.\mrank(i,c)$ is the number of occurrences of character $c$ in the prefix $S[1,i]$. For searching, we use a more general definition of \LF\nobreakdash-mapping:
$$
\mLF(i,c) = \mC[c] + \mBWT.\mrank(i, c),
$$
which tells the number of suffixes $X$ of text $T$ with $X \le c T[\mSA[i],n]$ in lexicographic order. This is known as the \emph{lexicographic rank} $\mrank(cT[\mSA[i],n], T)$ of string $cT[\mSA[i],n]$ among the suffixes of text $T$.

The \emph{FM-index} (\FMI) \cite{Ferragina2005a} uses the Burrows-Wheeler transform as a full-text index by augmenting it with a few additional structures. The lexicographic range $[sp,ep]$ corresponding to pattern $P$ is found through \emph{backward searching}. Let $[sp_{i},ep_{i}]$ be the lexicographic range of suffixes of text $T$ matching suffix $P[i, \abs{P}]$ of the pattern. We can find $[sp_{i-1},ep_{i-1}]$ as
$$
[sp_{i-1},ep_{i-1}] = [\mLF(sp_{i}-1, P[i-1]) + 1, \mLF(ep_{i}, P[i-1])].
$$
By starting from $[sp_{\abs{P}}, ep_{\abs{P}}] = [\mC[P[\abs{P}]]+1, \mC[P[\abs{P}]+1]]$, we can find the lexicographic range of suffixes starting with the pattern in $\Oh(\abs{P} \cdot t_{r})$ time, where $t_{r}$ is the time required to answer \rank{} queries on the \BWT. In practice, the time complexity ranges from $\Oh(\abs{P})$ to $\Oh(\abs{P} \log n)$, depending on the encoding.

In order to find the occurrences of the pattern in the text, the FM-index \emph{samples} some suffix array pointers, and uses \LF\nobreakdash-mapping to derive the unsampled pointers. The set of sampled pointers always includes a pointer to the beginning of each text. If $\mSA[i]$ is not sampled, the FM-index proceeds to $\mLF(i)$ and continues from there. If $\mSA[\mLF^{k}(i)]$ is the first sample encountered, $\mSA[i] = \mSA[\mLF^{k}(i)] + k$. Depending on the way the samples are selected, we may need a binary sequence to mark the pointers that have been sampled.

Assume that we have an ordered \emph{collection} of texts $\Acoll = (T_{1}, \dotsc, T_{m})$ of total length $n = \abs{\Acoll} = \sum_{i} \abs{T_{i}}$. We want to build a (generalized) \BWT{} for the collection. The usual way is to assume that the endmarkers of all text are distinct, with the one at the end of text $T_{i}$ having character value $(0,i)$. This guarantees that each suffix has a unique lexicographic rank among the suffixes of the collection. We still encode each endmarker as $0$ in the \BWT{} in order to save space. Because of this, \LF\nobreakdash-mapping does not work with $c = 0$, and we cannot match patterns spanning text boundaries.

With large collections of short texts (such as short reads), there are more space-efficient alternatives to sampling. Because all endmarkers had distinct character values during sorting, we know that $\mSA[i]$ (with $i \le m)$ points to the endmarker of text $T_{i}$. To reach the endmarker, we iterate
$$
\Psi(i) = \mBWT.\mselect(i - \mC[c], c),
$$
where $c$ is the largest character value with $\mC[c] < i$ and $S.\mselect(i,c)$ finds the $i$\nobreakdash-th occurrence of character $c$ in string $S$. If $k \ge 0$ is the smallest value for which $j = \Psi^{k}(i) \le m$, we know that $\mSA[i]$ points to offset $\abs{T_{j}} - k$ in text $T_{j}$.

We can \emph{extract} text $T_{i}$ in $\Oh(\abs{T_{i}} \cdot t_{r})$ time by using \LF\nobreakdash-mapping \cite{Burrows1994}. We start from the endmarker at $\mBWT[i]$ and extract the text backwards as
$$
T_{i}[\abs{T_{i}} - j] = \mBWT[\LF^{j-1}(i)] \,\, \textrm{for} \,\, 1 \le j < \abs{T_{i}}.
$$
As $\mSA[\LF^{j}(i)]$ points to suffix $T_{i}[\abs{T_{i}}-j, \abs{T_{i}}]$, we also extract the lexicographic ranks of all suffixes of text $T_{i}$ in the process.


\Section{Space-efficient \BWT{} construction}

The FM-index was introduced as a more space-efficient alternative to the suffix array. If we have to build a suffix array first in order to construct an FM-index, we lose a large part of the benefits, and index construction becomes the overall bottleneck. To overcome this bottleneck, we can use \emph{incremental construction algorithms} that build the FM-index directly. Some of them use an adjustable amount of working space on top of the FM-index, making it possible to index text colletions that are too large to fit in memory in an uncompressed form.

Assume that we have built the \BWT{} of text $T$, and we want to \emph{transform} the \BWT{} into that of text $cT$, where $c$ is a character. The transformation \cite{Hon2007} is quite straightforward. We first find the pointer $\mSA[i]$ to the beginning of text $T$ (the position $i$, where $\mBWT[i] = 0$). Then we determine the lexicographic rank $j = \mrank(cT, T) = \mC[c] + \mBWT.\mrank(i, c)$ of text $cT$ among the suffixes of text $T$. Finally we transform the \BWT{} by \emph{replacing} the endmarker at $\mBWT[i]$ with the inserted character $c$, and by \emph{inserting} a new endmarker between $\mBWT[j]$ and $\mBWT[j+1]$.

There are six basic ways of using the transformation in a \BWT{} construction algorithm. Instead of inserting a single character at a time, we may use \emph{batch updates} and transform the \BWT{} of text $T$ directly into that of text $XT$, where $X$ is any string \cite{Hon2007}. We can start with the \BWT{}s of text collections $\Acoll$ and $\Bcoll$, and \emph{merge} them into the \BWT{} of collection $\Acoll \cup \Bcoll$ \cite{Siren2009}. We can also have the \BWT{} of a text collection, and \emph{extend} multiple texts by inserting a new character to the beginning of each of them in a single update \cite{Bauer2013}. In all three cases, we may use either \emph{static} or \emph{dynamic} \cite{Chan2007} structures for the \BWT.


\Section{Merging \BWT{}s of text collections}

% FIXME: a running example with 2 texts
Assume that we have two text collections $\Acoll$ and $\Bcoll$ of total length $n_{\Acoll}$ and $n_{\Bcoll}$, respectively, and that we have already built \BWT{}s for them. Assume that we store the \BWT{}s in a two-level array, where the first level contains pointers to \emph{blocks} of size $b$ (bits). If the \BWT{} takes $x$ bits, the two-level array imposes $\frac{x}{b} \log x + \Oh(b)$ bits of space overhead. With $b = \sqrt{x \log x}$, the overhead becomes $\Oh(\sqrt{x \log x})$ bits.

The algorithm for merging $\mBWT_{\Acoll}$ and $\mBWT_{\Bcoll}$ into $\mBWT_{\Acoll \cup \Bcoll}$ has three phases \cite{Siren2009}:
\begin{enumerate}

\item \textbf{Search.} We search for all texts in collection $\Bcoll$ using $\mBWT_{\Acoll}$ and output the lexicographic rank $\mrank(X, \Acoll)$ for each suffix $X$ of collection $\Bcoll$. This takes $\Oh(n_{\Bcoll} t_{r})$ time. If we do not have collection $\Bcoll$ in plain form, we can extract the texts from $\mBWT_{\Bcoll}$ in the same asymptotic time.

\item \textbf{Sort.} We sort the ranks produced in the search phase to build the \emph{rank array} (\RA) of collection $\Bcoll$ relative to collection $\Acoll$. The rank array is defined as $\mRA_{\Bcoll \mid \Acoll}[i] = \mrank(X, \Acoll)$, where $X$ is the suffix pointed by $\mSA_{\Bcoll}[i]$. The array requires $n_{\Bcoll} \log n_{\Acoll}$ bits of space, and we can build it in $\Oh(sort(n_{\Bcoll}, n_{\Acoll}))$ time, where $sort(n, u)$ is the time required to sort $n$ integers from universe $[0,u]$. If we extract the texts from $\mBWT_{\Bcoll}$ in the search phase, we can write the ranks directly into the rank array, making this phase unnecessary.

\item \textbf{Merge.} We interleave $\mBWT_{\Acoll}$ and $\mBWT_{\Bcoll}$ according to the rank array. If $\mRA_{\Bcoll \mid \Acoll}[i] = j$, the merged \BWT{} will have $j$ characters from $\mBWT_{\Acoll}$ before $\mBWT_{\Bcoll}[i]$. This phase takes $\Oh(n_{\Acoll} + n_{\Bcoll})$ time. By reusing the blocks of $\mBWT_{\Acoll}$ and $\mBWT_{\Bcoll}$ for $\mBWT_{\Acoll \cup \Bcoll}$ once we have processed them, we can merge the \BWT{}s almost in-place, using $\Oh(\sqrt{x \log x})$ bits of working space, where $x$ is the maximum of the sizes of $\mBWT_{\Acoll}$ and $\mBWT_{\Bcoll}$ in bits.

\end{enumerate}
The overall time complexity of \BWT{} merging is $\Oh(n_{\Acoll} + n_{\Bcoll} t_{r})$. The algorithm uses $n_{\Bcoll} \log n_{\Acoll} + \Oh(\sqrt{x \log x})$ bits of working space in addition to the \BWT{}s and the structures required to use them as FM-indexes.

Alternatively, we can build the \emph{interleaving bitvector} $B_{\Acoll \cup \Bcoll}$. This bitvector is a binary sequence of length $n_{\Acoll} + n_{\Bcoll}$ that encodes the same information as the rank array. We set $B_{\Acoll \cup \Bcoll}[i + \mRA_{\Bcoll \mid \Acoll}[i]] = 1$ for $1 \le i \le n_{\Bcoll}$, denoting that $\mSA_{\Acoll \cup \Bcoll}[i + \mRA_{\Bcoll \mid \Acoll}[i]]$ points to a suffix of collection $\Bcoll$. This approach uses $n_{\Acoll} + n_{\Bcoll} + \Oh(\sqrt{x \log x})$ bits of working space.


\Section{Large-scale \BWT{} merging}

If we split text collection $\Acoll$ of total length $n$ into $p$ \emph{subcollections} of equal size, we can build $\mBWT_{\Acoll}$ by merging the \BWT{}s of the subcollections incrementally. Merging the subcollections takes $\Oh((p+t_{r})n)$ time and uses (essentially) $\min(n, \frac{n}{p} \log n)$ bits of working space.

When the collection is large, the space overhead of the construction algorithm often determines whether we can build the \BWT{} of the collection. Even if a static encoding of the \BWT{} fits in memory, a dynamic encoding may already be too large. The space overhead from the rank array or the interleaving bitvector (or their equivalents in the other space-efficient algorithms) may also be too much. We can make the rank array fit in memory by increasing the number of subcollections, but that can make the construction too slow.

The key to reducing the space overhead is that we do not need to keep the rank array in memory. We can write the lexicographic ranks to \emph{disk} in the search phase, sort them to build the rank array in the sort phase, and scan the rank array once during the merge phase. In practice, we may want to \emph{interleave} the sort phase with the search/merge phases and store the lexicographic ranks on disk in a \emph{compressed form}, because the amount of fast disk space can also become a bottleneck. In the following, we describe the key ideas that make large-scale \BWT{} merging fast and space-efficient.

\smallbreak\noindent\textbf{Search.} Instead of searching for every text in collection $\Bcoll$ separately, we can search for the \emph{reverse trie} of the collection. Assume that there are $m_{\Acoll}$ texts in collection $\Acoll$ and $m_{\Bcoll}$ texts in collection $\Bcoll$. The \emph{root} of the trie corresponds to suffix $\$$, which has lexicographic rank $m_{\Acoll}$ in $\Acoll$ and corresponds to lexicographic range $[1,m_{\Bcoll}]$ in $\Bcoll$.

% FIXME pseudocode if we have space for it
Assume that we have a \emph{node} of the trie corresponding to suffix $X$, lexicographic rank $r$, and lexicographic range $[sp,ep]$. This means that suffix $X$ occurs $sp+1-ep$ times in collection $\Bcoll$, so we can output a \emph{run} of lexicographic ranks $(r, sp+1-ep)$. Afterwards, we proceed to the \emph{children} of the node. For each character $c \in \Sigma$, we create a node corresponding to suffix $cX$, lexicographic rank $\mLF_{\Acoll}(r,c)$, and lexicographic range $[\mLF_{\Bcoll}(sp-1, c) + 1, \mLF_{\Bcoll}(ep, c)]$. Searching the branches of the trie can be done in parallel using multiple \emph{threads}.

\smallbreak\noindent\textbf{Buffering.} To reduce disk usage and the amount of disk I/O, we buffer and compress the lexicographic ranks before writing them to disk. Each thread has two buffers: a \emph{run buffer} and a \emph{thread buffer}. The run buffer stores the runs as pairs of integers $(r, \ell)$. Once the run buffer becomes full, we sort the runs by \emph{run heads} $r$, use \emph{differential encoding} for the run heads, and encode the differences and run lengths with a \emph{prefix-free code}. The compressed run buffer is then merged with the similarly compressed thread buffer.

Once a thread buffer becomes full, we insert it into the global \emph{merge buffers}. There are $k$ merge buffers, with buffer $i$ containing $2^{i-1}$ merged thread buffers. The insertion starts at buffer $1$. If buffer $i$ is empty, the thread inserts its thread buffer there and returns back to the search phase. Otherwise it takes the contents of merge buffer $i$, merges them with its thread buffer, and proceeds to buffer $i+1$. If a thread reaches buffer $k+1$, it writes its thread buffer to disk and returns back to work.

\smallbreak\noindent\textbf{Merge.} The merge phase starts with the rank array stored in multiple files on disk. Each of the files is sorted and compressed. For interleaving the \BWT{}s, we need to merge the files to produce an increasing run-length encoded stream of lexicographic ranks. We can also use multiple threads. One thread reads the files and performs a \emph{multiway merge} using a priority queue, producing a stream of lexicographic ranks. Another thread consumes the stream and uses it to interleave the \BWT{}s. Because the producer thread is usually slower than the consumer thread, we may want to use multiple threads for the multiway merge, if the disk is fast enough.


\Section{Implementation}

We have implemented the improved \BWT{} merging algorithm as a tool for merging the \BWT{}s of large collections of short reads. The tool, \BWTmerge{}, is written in C++, and the source code is available at GitHub.\footnote{\url{https://github.com/jltsiren/bwt-merge}} The implementation is built on top of the \emph{SDSL library} \cite{Gog2014b} and uses the features of C++11 extensively. As a result, it needs a fairly recent C++11 compiler to compile. We have successfully built \BWTmerge{} on Linux and OS~X using g++.

The target environment of \BWTmerge{} is a \emph{single node} of a \emph{computer cluster}. The system should have tens of CPU cores and hundreds of gigabytes of memory. The amount of local disk space might not be much larger than memory size, while there can be plenty of shared disk space available. The number of search threads is equal to the number of CPU cores, while the merge phase uses just one producer thread and one consumer thread. By adjusting the sizes of run buffers and thread buffers and the number of merge buffers, \BWTmerge{} should work reasonably well in different environments.

The internal alphabet of \BWTmerge{} is \texttt{012345}, which corresponds to either \texttt{\$ACGTN} or \texttt{\$ACGNT}, depending on where the \BWT{}s come from. \BWT{}s using different alphabetic orders cannot be merged. We use simple byte-level codes for \emph{run-length encoding} the \BWT{}s. The encoding of run $(c, \ell)$, where $c$ is the character value and $\ell$ is the length, depends on the length of the run. If $\ell \le 41$, the run is encoded in a single byte as $6 \cdot (\ell-1) + c$. Longer runs start with byte $6 \cdot 41 + c$, followed by the encoding of $\ell-42$. The remaining run length is encoded as a sequence of bytes, with the low 7 bits containing data and the high bit telling whether the encoding continues in the next byte. The compressed buffers use the same 7+1\nobreakdash-bit code for both the differentially encoded run heads and the run lengths.

For \rank/\select{} support, we split the \BWT{}s into 64\nobreakdash-byte blocks of compressed data, so that no run continues from one block to the next one. For each block $i$, we store the total number of characters in blocks $1$ to $i-1$ as $n_{i}$, as well as the cumulative character counts $c_{i} = \mBWT.\mrank(n_{i},c)$ for $0 \le c \le 5$. Each of these seven increasing sequences is encoded using the \emph{sdarray} encoding \cite{Okanohara2007}. To compute $\mBWT.\mrank(j,c)$, we first use a \rank{} query on the $n_{i}$ sequence to find the correct block. A \select{} query on the same sequence transforms position $j$ into an offset within the block, while a \select{} query on the $c_{i}$ sequence gives the rank up to the beginning of the block. Finally, we decompress the block to answer the query. \select{} queries and random accesses to the \BWT{} work in a similar way. There are also optimizations for e.g.~computing $\mrank(i,c)$ for all characters $c$, and for finding the children of a reverse trie node corresponding to a relatively short lexicographic range.

We use two-level arrays with 8\nobreakdash-megabyte blocks to store the \BWT{}s and the compressed buffers, managing the blocks using \texttt{mmap()} and \texttt{munmap()}. In our experiments, this direct memory management reduced the space overhead by tens of gigabytes. When multiple threads allocate memory in small enough blocks, the multithreaded \emph{glibc} implementation of \texttt{malloc()} creates a number of additional heaps that will never grow larger than 64~MB. Each thread tries to reuse the heap it used for the last allocation. If a heap is full or the thread cannot acquire the mutex, it moves to the next heap. With our workload of tens of threads allocating and freeing hundreds of gigabytes of memory, this created thousands of heaps with holes in the middle. Most calls to \texttt{free()} could not release the memory back to the operating system, while it took a while before a thread reached the heap where the \texttt{free()} took place.


\Section{Experiments}

We ran our experiments on a system with two 16\nobreakdash-core AMD Opteron 6378 processors and 256 gigabytes of memory. The system was running Ubuntu 12.04 on Linux kernel 3.2.0. All software was compiled with gcc/g++ version 4.9.2. We stored the input/output files on a distributed Lustre file system and used a local 0.5~TB hard disk for temporary files. \BWTmerge{} was allowed to use 32 threads, while the other \BWT{} construction tools were limited to 4 or 5 parallel threads by design. All reported merging times include verification by querying the \BWT{}s with 2~million $32$\nobreakdash-mers extracted from the human reference genome.

\begin{table}[t!]
\begin{center}
\caption{Datasets used in the experiments.}\label{table:datasets}\smallskip%
{
\renewcommand{\baselinestretch}{1}\footnotesize
\begin{tabular}{cc|ccc}
\hline
\textbf{Dataset} & \textbf{File} & \textbf{Size} & \textbf{Reads} & \textbf{Read length} \\
\hline
\CEU & NA12878 &  284 Gbp & 2.81G &       101 bp \\
     & NA12891 &  242 Gbp & 2.40G &       101 bp \\
     & NA12892 &  245 Gbp & 2.42G &       101 bp \\
     & Merged  &  771 Gbp & 7.63G &       101 bp \\
\hline
\RS  & AA      &  433 Gbp & 4.69G & 73 or 100 bp \\
     & TT      &  432 Gbp & 4.68G & 73 or 100 bp \\
     & AT      &  275 Gbp & 2.98G & 73 or 100 bp \\
     & TA      &  355 Gbp & 3.84G & 73 or 100 bp \\
     & Merged  & 1.49 Tbp & 16.2G & 73 or 100 bp \\
\hline
\RS  & *A, *C  & 2.45 Tbp & 26.5G & 73 or 100 bp \\
     & *G, *T  & 2.44 Tbp & 26.5G & 73 or 100 bp \\
\hline
\end{tabular}}
\end{center}
\end{table}

We used two datasets from the phase 3 of the \emph{1000 Genomes Project} \cite{1000GP2015}:
\begin{itemize}

\item \CEU{} contains reads from high-coverage sequencing of the \emph{CEU trio} (individuals NA12878, NA12891, and NA12892). We downloaded the gzipped FASTQ files (run accessions SRR622457, SRR622458, and SRR622459) and concatenated the files representing the same individual. Then we corrected the sequencing errors with BFC \cite{Li2015} (\texttt{bfc -s 3g -t 16}) separately for each individual.

\item \RS{} is from the \emph{ReadServer project}, which uses all low-coverage and exome data from the phase 3. After error correction, trimming the reads to 73 or 100~bp, and merging the duplicates, there are 53.0~billion unique reads for a total of 4.88~Tbp. The reads are in 16 run-length encoded \BWT{}s built using the \emph{String Graph Assembler} \cite{Simpson2012}, distributed according to the last two bases.

\end{itemize}
See Table~\ref{table:datasets} for further details on the datasets. We used a development version of \BWTmerge{} that was essentially equivalent to v0.3 for the experiments. For the other tools, we used the versions that were available on GitHub in October~2015.

\smallbreak\noindent\textbf{Benchmarking.} For benchmarking with different parameter values, we converted four \BWT{} files (AA, TT, AT, and TA) containing a total of 1.49~Tbp from the \RS{} dataset to the \emph{native format} of \BWTmerge. This format includes the \BWT{} and the \rank/\select{} structures required by the FM-index. We then merged the \BWT{}s (in the given order). We used 128~MB or 256~MB run buffers and 256~MB or 512~MB thread buffers. The number of merge buffers was 4 or 5 with 512~MB thread buffers and 5 or 6 with 256~MB thread buffers, so that the files on disk were always merged from either 8~GB or 16~GB of thread buffers.

\begin{figure}[t!]
\begin{center}
\includegraphics{benchmark.pdf}%
\hspace{-0.3in}%
\includegraphics{comparison.pdf}
\end{center}
\caption{Left: Time/space trade-offs for merging four \BWT{} files containing a total of 1.49~Tbp. Label rXbYmZ denotes $X$~MB run buffers, $Y$~MB threads buffers, and $Z$ merge buffers. The dashed line represents the total size of the last pair of \BWT{}s to be merged. Right: Time/space trade-offs for building the \BWT{} of the 771~Gbp \CEU{} dataset.}\label{fig:benchmark}
\end{figure}

The results of the benchmarks can be seen in Figure~\ref{fig:benchmark} (left). Merging took 31.4 to 35.7~hours and required 145 to 166~gigabytes of memory, depending on the parameter values. This corresponds to an average speed of 8.27 to 9.40~Mbp/s for inserting 1.06~Tbp into the initial file AA. As the total size of the last pair of \BWT{}s was 124.2~gigabytes, the memory overhead was 21.1 to 41.5~gigabytes. The temporary files required 287 to 306~gigabytes of disk space. For the further experiments, we chose 128\nobreakdash-megabyte run buffers and 256\nobreakdash-megabyte thread buffers. We use 5 merge buffers in the \emph{small} variant (memory overhead 21.1~GB) and 6 merge buffers in the \emph{fast} variant (overhead 30.8~GB).

\smallbreak\noindent\textbf{Comparison.} In the next experiment, we compared the performance of \BWTmerge{} to the \ropebwt{} \cite{Li2011-2013} implementation of the BCR algorithm \cite{Bauer2013} and to \ropebwtii{} \cite{Li2014a}. These two tools are the fastest existing \BWT{} construction tools for read collections on general hardware \cite{Li2014a}. We ran \ropebwt{} with parameters \texttt{-btORf -abcr} and \ropebwtii{} with parameters \texttt{-bRm10g}. Due to memory constraints, we used the smaller \CEU{} dataset.

We built the \BWT{} of the dataset in four ways. We concatenated all input files, and built the \BWT{} directly for the entire collection using both \ropebwt{} and \ropebwtii{}. We built the \BWT{} for each input file separately using \ropebwt{}, and then merged the results using \BWTmerge{} with small settings. Finally, we built the individual \BWT{}s using parallel \ropebwt{} jobs, and then merged them using \BWTmerge{} with fast settings. In each case, the tools were allowed to output the \BWT{} files in their preferred formats.

The results can be seen in Figure~\ref{fig:benchmark} (right).
% FIXME discussion of the results; note that ropebwt2 did not finish

\smallbreak\noindent\textbf{ReadServer.} In the final experiment, we merged the 16 \BWT{} files in the \RS{} dataset into two files using the fast settings.
% FIXME sizes in native/sga format, merging time/space


\Section{Discussion}

conclusions: fast, low memory overhead. because BWTs are in-memory data structures, they can be built on the same system they are going to be used

sequence ordering: original order may be important, RLO/RCLO for compression, RLCO for mapping between reverse complements, mapping position for encoding pairing information; ReadServer currently uses RLO

remove duplicates


\Section{References}
\bibliographystyle{IEEEtran}
\bibliography{paper}


\end{document}