Skip to content

Commit

Permalink
Verification that the format exists.
Browse files Browse the repository at this point in the history
  • Loading branch information
Jouni Siren committed Oct 26, 2015
1 parent 281fe40 commit a8f9d74
Show file tree
Hide file tree
Showing 6 changed files with 53 additions and 11 deletions.
11 changes: 10 additions & 1 deletion README.md
Expand Up @@ -112,7 +112,12 @@ There are also other algorithms for building the BWT for large read collections
* `BlockArray` uses now 8 MB blocks instead of 1 MB blocks, changing the native file format.
* More space-efficient rank/select construction for the BWT.
* Formats: RopeBWT (new), faster writing in SGA format.
* `bwt_merge`: Multiple input files, faster RA/BWT merging, multithreaded verification, adjustable input/output formats and temp directory, better default parameters.
* `bwt_merge`: Several improvements:
* Multiple input files in different formats.
* Faster RA/BWT merging.
* Multithreaded verification.
* Adjustable temp directory.
* Better default parameters.

### Version 0.2.1

Expand Down Expand Up @@ -145,6 +150,10 @@ There are also other algorithms for building the BWT for large read collections
* in reverse lexicographic order
* by position in the reference
* `bwt_merge`: Option to remove duplicate sequences.
* `bwt_merge`: Option to write intermediate merge results to temporary files.
* the latest result to avoid restarting after crashes
* all intermediate results
* `bwt_merge`: Option to use different merge parameters for each merge.
* `bwt_convert`: Build rank/select only when necessary.
* Documentation in the wiki.

Expand Down
20 changes: 13 additions & 7 deletions bwt_convert.cpp
Expand Up @@ -56,27 +56,33 @@ main(int argc, char** argv)
{
case 'i':
input_tag = optarg;
if(!formatExists(input_tag))
{
std::cerr << "bwt_convert: Invalid input format: " << input_tag << std::endl;
std::exit(EXIT_FAILURE);
}
break;
case 'o':
output_tag = optarg;
if(!formatExists(output_tag))
{
std::cerr << "bwt_convert: Invalid output format: " << output_tag << std::endl;
std::exit(EXIT_FAILURE);
}
break;
case '?':
default:
std::exit(EXIT_FAILURE);
}
}

if(optind < argc) { input_name = argv[optind]; }
else
{
std::cerr << "bwt_convert: Input file not specified" << std::endl;
}
if(optind + 1 < argc) { output_name = argv[optind + 1]; }
else
if(optind + 1 >= argc)
{
std::cerr << "bwt_convert: Output file not specified" << std::endl;
std::exit(EXIT_FAILURE);
}
input_name = argv[optind];
output_name = argv[optind + 1];

std::cout << "Input: " << input_name << " (" << input_tag << ")" << std::endl;
std::cout << "Output: " << output_name << " (" << output_tag << ")" << std::endl;
Expand Down
13 changes: 13 additions & 0 deletions bwt_merge.cpp
Expand Up @@ -89,9 +89,22 @@ main(int argc, char** argv)
break;
case 'i':
tokenize(optarg, input_formats, ',');
for(size_type i = 0; i < input_formats.size(); i++)
{
if(!formatExists(input_formats[i]))
{
std::cerr << "bwt_merge: Invalid input format: " << input_formats[i] << std::endl;
std::exit(EXIT_FAILURE);
}
}
break;
case 'o':
output_format = optarg;
if(!formatExists(output_format))
{
std::cerr << "bwt_merge: Invalid output format: " << output_format << std::endl;
std::exit(EXIT_FAILURE);
}
break;
case '?':
default:
Expand Down
12 changes: 12 additions & 0 deletions formats.cpp
Expand Up @@ -446,6 +446,18 @@ SGAFormat::write(std::ofstream& out, const BlockArray& data, const NativeHeader&

//------------------------------------------------------------------------------

bool
formatExists(const std::string& format)
{
return (format == NativeFormat::tag)
|| (format == PlainFormatD::tag)
|| (format == PlainFormatS::tag)
|| (format == RFMFormat::tag)
|| (format == SDSLFormat::tag)
|| (format == RopeFormat::tag)
|| (format == SGAFormat::tag);
}

void
printFormats(std::ostream& stream)
{
Expand Down
2 changes: 2 additions & 0 deletions formats.h
Expand Up @@ -157,6 +157,8 @@ struct SGAFormat

//------------------------------------------------------------------------------

bool formatExists(const std::string& format);

void printFormats(std::ostream& stream);

template<class Format>
Expand Down
6 changes: 3 additions & 3 deletions paper/paper.tex
Expand Up @@ -211,7 +211,7 @@

\Section{Implementation}

We have implemented the proposed enhancements to the \BWT{} merging algorithm in a tool (\BWTmerge) intended for merging the \BWT{}s of large collections of short reads. \BWTmerge{} is written in C++, and the source code is available at GitHub.\footnote{\url{https://github.com/jltsiren/bwt-merge}} The implementation is built on top of the \emph{SDSL library} \cite{Gog2014b} and uses the features of C++11 extensively. As a result, it needs a fairly recent C++11 compiler to compile. We have built \BWTmerge{} on Linux and OS~X using g++.
We have implemented the improved \BWT{} merging algorithm as a tool for merging the \BWT{}s of large collections of short reads. The tool, \BWTmerge{}, is written in C++, and the source code is available at GitHub.\footnote{\url{https://github.com/jltsiren/bwt-merge}} The implementation is built on top of the \emph{SDSL library} \cite{Gog2014b} and uses the features of C++11 extensively. As a result, it needs a fairly recent C++11 compiler to compile. We have successfully built \BWTmerge{} on Linux and OS~X using g++.

The target environment of \BWTmerge{} is a \emph{single node} of a \emph{computer cluster}. The system should have tens of CPU cores and hundreds of gigabytes of memory. The amount of local disk space might not be much larger than memory size, while there can be plenty of shared disk space available. The number of search threads is equal to the number of CPU cores, while the merge phase uses just one producer thread and one consumer thread. By adjusting the sizes of run buffers and thread buffers and the number of merge buffers, \BWTmerge{} should work reasonably well in different environments.

Expand Down Expand Up @@ -261,9 +261,9 @@
\item \RS{} is from the \emph{ReadServer project}, which uses all low-coverage and exome data from the phase 3. After error correction, trimming the reads to 73 or 100~bp, and merging the duplicates, there are 53.0~billion unique reads for a total of 4.88~Tbp. The reads are in 16 run-length encoded \BWT{}s built using the \emph{String Graph Assembler} \cite{Simpson2012}, distributed according to the last two bases.

\end{itemize}
See Table~\ref{table:datasets} for further details on the datasets.
See Table~\ref{table:datasets} for further details on the datasets. We used a development version of \BWTmerge{} that was essentially equivalent to v0.3 for the experiments. For the other tools, we used the versions that were available on GitHub in October~2015.

\smallbreak\noindent\textbf{Benchmarking.} For benchmarking with different parameter values, we converted four \BWT{} files (AA, TT, AT, and TA) containing a total of 1.49~Tbp from the \RS{} dataset to the native format used by \BWTmerge. Then we merged the \BWT{}s (in the given order). We used 128~MB or 256~MB run buffers and 256~MB or 512~MB thread buffers. The number of merge buffers was 4 or 5 with 512~MB thread buffers and 5 or 6 with 256~MB thread buffers, so that the files on disk were always merged from either 8~GB or 16~GB of thread buffers.
\smallbreak\noindent\textbf{Benchmarking.} For benchmarking with different parameter values, we converted four \BWT{} files (AA, TT, AT, and TA) containing a total of 1.49~Tbp from the \RS{} dataset to the \emph{native format} of \BWTmerge. This format includes the \BWT{} and the \rank/\select{} structures required by the FM-index. We then merged the \BWT{}s (in the given order). We used 128~MB or 256~MB run buffers and 256~MB or 512~MB thread buffers. The number of merge buffers was 4 or 5 with 512~MB thread buffers and 5 or 6 with 256~MB thread buffers, so that the files on disk were always merged from either 8~GB or 16~GB of thread buffers.

\begin{figure}[t!]
\begin{center}
Expand Down

0 comments on commit a8f9d74

Please sign in to comment.