Before you download Arche (13Gb), make sure GeneMarkS-2 (GMS2) is working properly on your computer. As GMS2 requires a licence (free), you must download it manually
Download GeneMarkS-2 and key from http://exon.gatech.edu/GeneMark/license_download.cgi
tar xvfz gms2_linux_[version].tar.gz
Move the dir to the desired place, and make the binary files accesible to your PATH (e.g. add export PATH=$PATH:</path/to/gms2_linux_[version]> to your ~/.bashrc file)
Configure the key you've downloaded
gunzip gm_key.gz
cp gm_key ~/.gmhmmp2_key
or
cp gm_key ~/.gm_key
Test the software
gms2.pl --seq YOUR_GENOME
To install the other dependencies, you will require the anaconda distribution. Download and install it from https://www.anaconda.com/download/success
conda create -n arche_annotator diamond=2.0.14 bedtools=2.27.0 p7zip=16.02 barrnap=0.9 hmmer=3.3.2 prodigal=2.6.3 blast=2.12.0 fasta3=36.3.8i ucsc-fasomerecords=455 trnascan-se=2.0.9 gdown -c bioconda -c conda-forge
This command wil create a conda environment for arche future runs. It includes the installation of specific packages from bioconda and conda-forge channels.
The program with the already formatted databases and mapping files can be downloaded (13Gb) via command line using gdown:
conda activate arche_annotator
gdown --fuzzy https://drive.google.com/file/d/1x9caXGPpYXCHUoodOdnuJI0tCDe9qtGG/view?usp=sharing
Once the download is finished:
tar -xvf arche_[version].tar (move the output directory to the desired place)
cd arche_[version]/bin/
chmod +777 arche.sh
./arche.sh --install
You should make the bin directory accessible to your PATH (e.g. add export PATH=$PATH:</path/to/arche_[version]/bin> to your ~/.bashrc file)
In the case the instalation process or the running fails:
- Check you are working within the conda environment you've created ("conda activate arche_annotator")
- Check you have properly installed GeneMarkS-2
- If you have already run the command ./arche.sh --install, open the arche.sh script using a text editor and in the section "Main directory" (first lines) replace the string after DIR= with the full path of the working directory, e.g. /home/YOUR_USER/arche_1.0.1
- Delete arche's directory, uncompress from tar file, and install again
arche.sh -n ecoli -t 20 -r 40 e_coli.fna
arche.sh -n halorubrum -a ssearch -k achaea halorubrum_sp_DM2.fa
arche.sh -n seawater_meatgenome -k meta seawater_metagenome.fna
Here you can download a sample which includes the annotation of Escherichia coli K12 with several tools including Arche:
File(s) | Description |
---|---|
rRNA.tsv | GFF v3 file containing rRNA annotations. |
rRNA.fna | FASTA file of all rRNA features. |
tRNA.tsv | Table with tRNA details (coordinates, isotype, anticodon, scores, etc). |
[...]_struc_annot.fna | FASTA file of all genomic features (nucleotide). |
[...]_struc_annot.faa | FASTA file of translated coding genes (aminoacid). |
heuristic[...]_out | Output matches of the search instance(s) performed with BLASTp, DIAMOND or SSEARCH36. |
heuristic[...]_non_match.faa | FASTA file with the remaining non-matched sequences after the search instance(s) performed with BLASTp, DIAMOND or SSEARCH36. |
hmmscan_[...]_out | HMMER3 output table of the search instance(s) performed against a specific HMMDB. |
[HMMDB]_non_match.faa | FASTA file with the remaining non-matched sequences after the search instance performed against a specific HMMDB. |
[...]_omic_table.tbl | Feature table with fields separated by vertical bars. |
[...]_omic_table.tsv | Feature table with tab-separated fields. |
arche_report | File which includes the parameters of the run and results. |
-h, --help This help.
-i, --install Set up the executable location, and install databases.
-n, --name-files Name of the files to be created in the output directory, in-
cluding the directory itself (default 'arche').
-o, --output Provide the full path to the directory where the output di-
rectory will be created. E.g. /home/user/ (default current).
-k, --kingdom Source of the contigs. Use 'arch' for archaeal genomes or
'meta' for metagenomes (default is for bacterial genomes).
-m, --mode Gives priority to Orthology (KO, eggNOG) or Enzyme Comission
designed databases during the annotation. Use 'kegg' for KO-->
eggNOG-->E.C., 'eggnog' for eggNOG-->KO-->E.C., or 'ec' for
E.C.-->KO-->eggNOG (default will use a shorter swiss-prot KO·
·eggNOG·E.C. designed database with no priority).
-a, --alignment Select the algorithm to use during the protein alignment step:
'diamond' (accelerated blastp) or 'ssearch' (Smith-Waterman)
(default 'blastp').
-t, --threads Number of threads to use (default '1').
-r, --memory Amount of RAM to use in GB (default '2').
-e, --evalue Similarity e-value cut-off (default '1e-08').
-q, --query-cov Minimum coverage on query protein (default '70').
-b, --bypass Use 'yes' to bypass the RNA gene prediction.
-v, --verbose Use 'yes' to turn on the verbose mode.
- Daniel Alonso
- email: gundizalvus16@hotmail.com