SACO: a compression tool for the sequences alignments found in the MAF files.
This compression tool was designed to handle the DNA bases and gap symbols that can be found in MAF files. Our method is based on a mixture of finite-context models. Contrarily to a recent approach (Hanus 2010), it addresses both the DNA bases and gap symbols at once, better exploring the existing correlations. For comparison with previous methods, our algorithm was tested in the multiz28way dataset. On average, it attained 0.94 bits per symbol, approximately 7% better than the previous best, for a similar computational complexity. We also tested the model in the most recent dataset, multiz46way. In this dataset, that contains alignments of 46 different species, our compression model achieved an average of 0.72 bits per MSA block symbol.
In order to compile the source code, you will need to install a GCC compiler on a Unix platform (Linux or OS X). If you are using Windows, it will be easy to use the pre-compiled binaries that are in folders win32 and win64.
For Linux users, install the build-essentials package which contains GCC and other utilities in order to be able to compile the source code. To install the build-essentials package type:
sudo apt-get install build-essential
After that you only need to type:
make -f Makefile.linux
to create the binaries SACOe (encoder) and SACOd (decoder).
For OS X users, it depends on which Xcode version is installed. For the most recent versions, you will need to install the "Command Line Tool" in order to have the "make" utility. It seems that the "Command Line Tools" are not installed by default anymore when you install Xcode. In order to install them, open Xcode, go to Preferences -> Downloads -> Components -> Command Line Tools. This also should install a GCC compiler as well. If you want a recent compiler you can install it using Homebrew by typing the following command in a Terminal:
brew install gcc48
After that, we need to make sure that the "CC" variable in the "Makefile.osx" file is linked to the GCC previously installed. The most recent versions of XCode come with a modified version of GCC known as LLVM. This tool was not tested using LLVM so it will probably not work if you try to compile the code using it. In order to generate the binaries just type:
make -f Makefile.osx
to create the binaries SACOe (encoder) and SACOd (decoder).
The source code was NOT tested in a Windows enviroment. Nevertheless, you can compile the code using a cross-compiler in a Linux environment after installing the cross-compiler MinGW-w64. After installing MinGW-w64, just type:
make -f Makefile.win32
to get the SACOe32.exe (encoder) SACOd32.exe (decoder) executables (32-bits architecture) and for the 64-bits architecture just type:
make -f Makefile.win64
to get the SACOe64.exe (encoder) and SACOd64.exe (decoder) executables. The encoder seems to work just fine however there is a bug in the decoder that will be fixed soon...
The SACOe, SACOe32.exe, and SACOe64.exe programs have several parameters that can be defined by the user. In the following you can find a description with the most relevant parameters available.
Usage: SACOe [options] ... [MAF File]
The most relevant options are:
-v | Activates vervose mode. |
-h | Prints some help information. |
-o [encodedFile] | If present, it writes the encoded data into file "encodedFile". |
-e | Estimation only. Does not create the binary compressed file. |
-alm | Activate the acenstral line mode. |
-scm | Activate the static column model. |
-cm1 [n/d t=threshold] | Columnwise Model 1. |
-cmn [n/d t=threshold] | Columnwise Model N. |
-u 0 [leftSize-rightSize n/d t=threshold] | Ancestral context model with "leftSize" symbols on the left and "rightSize" symbols on the right. |
-u template [n/d t=threshold] | 2D image context template. Templates available 1-14 and 20-24. |
-g [gamma] | Gamma value used in the model mixture. |
The SACOd, SACOd32.exe, and SACOd64.exe programs have the following interface:
Usage: SACOd [options] ... [Encoded File]
In the following, we will show some examples of how to use this tool in a linux environment.
We can encode a MAF file using for example two models (order-9 and order-11) and put the encoded file in "file.enc" by typing:
$ SACOe -u 9 -u 11 -o file.enc chrM.maf
For decoding the encoded file just type:
$ SACOd -o file.dec file.enc
Some data set that can be used for evaluate this tool.
If you use this software, please cite the following publications:
- Luís M. O. Matos, Diogo Pratas, and Armando J. Pinho, "A Compression Model for DNA Multiple Sequence Alignment Blocks", in IEEE Transactions on Information Theory, volume 59, number 5, pages 3189-3198, May 2013.
- Luís M. O. Matos, Diogo Pratas, and Armando J. Pinho, "Compression of whole genome alignments using a mixture of finite-context models", in Proceedings of the International Conference on Image Analysis and Recognition, ICIAR 2012, (Editors: A. Campilho and M. Kamel, volume 2324 of Lecture Notes in Computer Science (LNCS)), pages 359-366, Springer Berlin Heidelberg, Aveiro, Portugal, June 2012.
The windows decoders (SACOd32.exe and SACOd64.exe) have a bug that will be fixed soon... For other issues please use the issues link at GitHub.
Copyright (c) 2014 Luís M. O. Matos. See LICENSE.txt for further details.