Spark Sequence Reducer

Spark Sequence Reducer is an application designed to reduce the size of whole genome reference sequence databases typically used in sequence analysis tools. Spark Sequence Reducer uses Apache Spark and the MapReduce like model to achieve high scalability, this makes possible the reduction of large sequence databases in an acceptable time frame, that otherwise, wouldn't be possible.

Spark Sequence Reducer uses taxonomic classification data of the sequences in the reference databases to find and remove highly similar regions in sequences sharing a taxon of a determined rank. The algorithm saves similar data only once, but keeps unique regions of every sequence, avoiding data loss.

Requirements

A Python 3 or Python 2 (Python 3 preferred) on all nodes of your cluster.
GCC for C code compilation.
A configured Apache Spark installation.

Compiling Stretcher

Spark Sequence Reducer uses Stretcher, a global alignment algorithm to find highly similar regions of closely related sequences. Stretcher only uses linear space to find the optimal pairwise global alignment. The low memory used by the algorithm makes possible the alignment in parallel of a larger number of sequences as part of the reduction process.

A precompiled dynamic library (stretcher.so) is provided, but you can compile it manually if needed using make:

make compile

Stretcher is part of Emboss, and has been modified for standalone use. Spark Sequence Reducer wraps Stretcher using ctypes to facilitate calling the program within python code.

Configuration instructions

Compile stretcher, the global alignment algorithm (See Compiling Stretcher).
Download nucl_gb.accession2taxid.gz, taxdump.tar.gz, directory of spark sequence reducer (Either manually or running ncbitax-download.sh).
Extract names.dmp and nodes.dmp from data/taxdump.tar.gz to data/ (you can skip this step if you used ncbitax-download.sh).
Run make configure to create and configure secondary files using the taxonomy files.
If running on multiple machines, replicate the spark sequence reducer directory to all the nodes in your cluster.

Running Spark Sequence Reducer

To run the program using spark-submit:

$SPARK_HOME/bin/spark-submit [spark-options] sparkseqreducer.py [-h]
                          [-r [{species,genus,family,order,class,phylum,superkingdom}]]
                          infile outfile

Arguments:

positional arguments:
  infile                Path to the input file containing the reference sequences to
                        reduce.
  outfile               Path to the output file to save the resulting reduced
                        sequences.

optional arguments:
  -h, --help            show this help message and exit
  -r [{species,genus,family,order,class,phylum,superkingdom}], --rank [{species,genus,family,order,class,phylum,superkingdom}]
                        The taxonomic rank to use for the reduction. Must be
                        one of the following: species, genus, family, order,
                        class, phylum or superkingdom

For example, to reduce a fasta file to output.fasta with species selected as the taxonomic rank for reduction: $SPARK_HOME/bin/spark-submit sparkseqreducer.py --rank species example.fasta $HOME/output

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
data		data
sparkseqreducer		sparkseqreducer
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
makefile		makefile
ncbitax-download.sh		ncbitax-download.sh
sparkseqreducer.py		sparkseqreducer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

data

data

sparkseqreducer

sparkseqreducer

test

test

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

makefile

makefile

ncbitax-download.sh

ncbitax-download.sh

sparkseqreducer.py

sparkseqreducer.py

Repository files navigation

Spark Sequence Reducer

Requirements

Compiling Stretcher

Configuration instructions

Running Spark Sequence Reducer

About

Releases

Packages

Languages

License

ja-pg/Spark-Sequence-Reducer

Folders and files

Latest commit

History

Repository files navigation

Spark Sequence Reducer

Requirements

Compiling Stretcher

Configuration instructions

Running Spark Sequence Reducer

About

Topics

Resources

License

Stars

Watchers

Forks

Languages