A tool to find relative absent words in genomic data
C Shell CMake
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
imgs
scripts
src
LICENSE
README.md

README.md

EAGLE

Search for relative absent words (RAWs) in genomic sequences using a reference sequence. Currently, EAGLE runs on a command line environment. It reports into files the absent words, in a k-mer range size, as well as the associated positions. EAGLE can run in a multi-thread mode to minimize computation times.

INSTALLATION

Cmake is needed for installation (http://www.cmake.org/). You can download it directly from http://www.cmake.org/cmake/resources/software.html or use an appropriate packet manager. In the following instructions we show the procedure to install, compile and create the RAWs:

STEP 1

Download, install and resolve conflicts.

Linux

#sudo apt-get install cmake
git clone https://github.com/pratas/eagle.git
cd eagle/src/
cmake .
make

Alternatively, you can install (without cmake and only for linux) using

git clone https://github.com/pratas/eagle.git
cd eagle/src/
mv Makefile.linux Makefile
make

OS X

Install brew:

ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"

only if you do not have it. After type:

brew install cmake
brew install wget
brew install gcc48
wget https://github.com/pratas/eagle/archive/master.zip
unzip master.zip
cd eagle-master/src/
cmake .
make

With some versions you might need to create a link to cc or gcc (after the brew install gcc48 command), namely

sudo mv /usr/bin/gcc /usr/bin/gcc-old   # gcc backup
sudo mv /usr/bin/cc /usr/bin/cc-old     # cc backup
sudo ln -s /usr/bin/gcc-4.8 /usr/bin/gcc
sudo ln -s /usr/bin/gcc-4.8 /usr/bin/cc

In some versions, the gcc48 is installed over /usr/local/bin, therefore you might need to substitute the last two commands by the following two:

sudo ln -s /usr/local/bin/gcc-4.8 /usr/bin/gcc
sudo ln -s /usr/local/bin/gcc-4.8 /usr/bin/cc

Windows

In windows use cygwin (https://www.cygwin.com/) and make sure that it is included in the installation: cmake, make, zcat, unzip, wget, tr, grep (and any dependencies). If you install the complete cygwin packet then all these will be installed. After, all steps will be the same as in Linux.

EXECUTION

As an example, the objective is to find minimal absent words (RAWs) that appear in a E. coli and not in the assembled (GRC) human chromosome 18, for k-mer sizes between 9 and 13 (including inverted words).

Get data

EAGLE accepts fasta (http://en.wikipedia.org/wiki/FASTA_format) and seq (ACGTN characters) formats. Therefore, data might be downloaded using a graphical interface or by wget and after decompressed. Above we use wget:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38_chr18.fa.gz ;
gunzip hs_ref_GRCh38_chr18.fa.gz ;
mv hs_ref_GRCh38_chr18.fa C18.fa ;
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__DH10B_uid58979/NC_010473.fna ;
mv NC_010473.fna ECOLI.fna ;

Run EAGLE

Run EAGLE using:

./EAGLE -v -t -min 9 -max 13 -i -r C18.fa ECOLI.fna

It will create files with prefix "ECOLI.fna" followed by the suffix "-k9.eg". The number 9 stands for k = 9. The "-k10.eg" stands for k = 10 and so on. Each file contains the respective(s) RAWs for each k along with the respective positions (the content is ordered by positions). If a file is empty it means that there are no RAWs, nevertheless the running output in the console should indicate something as "RAWs FOUND : 0.0000 % ( 0 in 4753180 )" (for k=8).

PARAMETERS

To see the possible options type

./EAGLE

or

./EAGLE -h

These will print the following options:

Usage: EAGLE <OPTIONS> ... -r [FILE] [FILE]:<...> -v verbose mode, -a about EAGLE, -t use multi-threading, -i use inversions, -min <k-mer> k-mer minimum size, -max <k-mer> k-mer maximum size, -r [rFile] reference file (db), [tFile1]:<tFile2>:<...> target file(s). EAGLE is a fast method/tool to compute relative MAWs. The input files should be FASTA (.fa) or SEQ [ACGTN].

Options meaning

Parameters Meaning
-h It will print the parameters menu (help menu)
-v It will print progress information such as number of MAWs, etc.
-a It will print the EAGLE version number, license type and authors.
-t It will use multiple-threading. The number of threads will be equal to the maximum k-mer less the minimum k-mer. The time to accomplish the task will be much lower, although it will use more memory (memory from each model is cumulative).
-i Inverted words (reverse complemented) will also be considered.
-min <k-mer> Size of the minimum k-mer (word size). Possible interval [1;28]. Contexts above 16 will be handled with a hash-table, where the implementation is approximately linear in memory relatively to the size of the sequence.
-max <k-mer> Size of the maximum k-mer (word size). Possible interval [1;28]. Contexts above 16 will be handled with a hash-table, where the implementation is approximately linear in memory relatively to the size of the sequence.
-r [refFile] The reference filename. Accepted sequence alphabet [A,C,G,T,N].
[tarFile] The target filename(s). For multiple file usage separate by ":". Example: Virus1:Virus2:virus3. Accepted sequence alphabet [A,C,G,T,N].

CITATION

On using this software/method, please cite:

Raquel M. Silva, Diogo Pratas, Luísa Castro, Armando J. Pinho & Paulo J. S. G. Ferreira. Bioinformatics (2015): btv189. DOI: [10.1093/bioinformatics/btv189] (http://doi.org/10.1093/bioinformatics/btv189).

ISSUES

For any issue let us know at issues link.

LICENSE

GPL v3.

For more information:

http://www.gnu.org/licenses/gpl-3.0.html