Skip to content
ariahahn edited this page Jun 30, 2017 · 24 revisions

Welcome to the FragGeneScan-Plus wiki! Here we have a knowledge base detailing installation and usage of FragGeneScan-Plus.

FragGeneScan-Plus for scalable high-throughput short-read open reading frame prediction

Table of Contents

Overview

A fundamental step in the analysis of environmental sequence information is the prediction of potential genes or open reading frames (ORFs) encoding the metabolic potential of individual cells and entire microbial communities. FragGeneScan, a software designed to predict intact and incomplete ORFs on short sequencing reads combines codon usage bias, sequencing error models and start/stop codon patterns in a hidden Markov model to find the most likely path of hidden states from a given input sequence, provides a promising route for gene recovery in environmental datasets with incomplete assemblies. However, the current implementation of FragGeneScan does not scale efficiently with increasing input data size. This limits application of FragGeneScan to contemporary environmental datasets that can exceed 100s of Gb. Here, we present FragGeneScan-Plus, an improved implementation of the FragGeneScan gene prediction model that leverages algorithmic thread synchronization and efficient in-memory data management to utilize multiple CPU cores without blocking I/O operations. FragGeneScan-Plus can process data approximately 5-times faster than FragGeneScan using a single core and approximately 50-times faster using eight hyper-threaded cores when benchmarked against simulated and real world environmental datasets of varying complexity.

Setup and Dependencies

FragGeneScan-Plus has the following dependencies:

  • make
  • gcc

To install FragGeneScanPlus, please follow the steps below:

  1. Download a release or clone the github repository and skip to step 3.

  2. Unzip the downloaded file "FragGeneScanPlus.tar.gz". This will automatically generate the directory "FragGeneScanPlus".

  3. Run make to compile and build the executable "FGS+" from the FragGeneScanPlus root directory.

Running FragGeneScan-Plus

NOTE: FGS+ should always be run from the FragGeneScan+ directory where the train directory is.

USAGE: ./FGS+ -s [seq_file_name] -m [max_mem_use] -o [output_file_name] -w [1 or 0] -t [train_file_name] -p [thread_num] -e [1 or 0] -d [1 or 0]

EXAMPLE USAGE: ./FGS+ -s example/NC_000913-454.fna -o output -w 0 -t 454_5 -p 16

MINIMAL USAGE: ./FGS+ -s [seq_file_name] -o [output_file_name] -w [1 or 0] -t [train_file_name]

INFO: FragGeneScanPlus will only output the amino acid files by default. To obtain the gene coordinate information set -e to 1 and for the DNA files set -d to 1

Mandatory parameters

  • -s [seq_file_name]: sequence file name including the full path

                       for standard input specify as stdin
    
  • -o [output_file_name]: output file name including the full path

                       for standard output specify as stdout
    
  • -w [1 or 0]: 1 if the sequence file has complete genomic sequences 0 if the sequence file has short sequence reads

  • -t [train_file_name]: file name that contains model parameters; this file should be in the "train" directory

  •                 Note that four files containing model parameters already exist in the "train" directory
    
  •                 [complete] for complete genomic sequences or short sequence reads without sequencing error
    
  •                 [sanger_5] for Sanger sequencing reads with about 0.5% error rate
    
  •                 [sanger_10] for Sanger sequencing reads with about 1% error rate
    
  •                 [454_5] for 454 pyrosequencing reads with about 0.5% error rate
    
  •                 [454_10] for 454 pyrosequencing reads with about 1% error rate
    
  •                 [454_30] for 454 pyrosequencing reads with about 3% error rate
    
  •                 [illumina_1] for Illumina sequencing reads with about 0.1% error rate
    
  •                 [illumina_5] for Illumina sequencing reads with about 0.5% error rate
    
  •                 [illumina_10] for Illumina sequencing reads with about 1% error rate
    

Optional flags

  • -p [thread_num] The number of threads used by FragGeneScan; default is 1 thread.
  • -e [1 or 0] Output metadata for sequences.
  • -d [1 or 0] Output DNA file.
  • -m [max_mem_usage] Maximum amount of memory to be used by the application, in megabytes, default 1024 for 1GB
  1. To test FragGeneScanPlus with a complete genomic sequence,

./FGS+ -s example/NC_000913-454.fna -o output -w 0 -t 454_5 -p 16

Upon completion, FragGeneScanPlus can generate three files [output_file_name].faa , [output_file_name].out and [output_file_name].ffn.

  1. The "[output_file_name].faa" file lists amino acid sequences corresponding to the putative genes in "[output_file_name].out". The "[output_file_name].faa" file is always generated and needs no flags to trigger it's output. An example of the "[output_file_name].faa" file looks like this,

gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=108 e nd=338 strand=- VVTSLPLVEKKSPHCQVRAFFCVSCTRQPAPLPVVMVMVVVMVVLMRFMDVVYSVIFICLCAMPILVKVFSDLSQ gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=343 e nd=2799 strand=+ LKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKH VLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLG RNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRDEDE LPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVG DGMRTLRGISAKFFAALARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSWLKNKHIDLRVCGV ANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKF LYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEP VLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLPLVLRGYGAGNDVTA AGVFADLLRTLSWKLGV gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=2801 end=3733 strand=+ VKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSAC SVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCI AHGRHLAGFIHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVADWLGKNYLQNQEGFVHICRLD TAGARVLEN

  1. The "[output_file_name].out" file lists the coordinates of putative genes. This file consists of five columns (start position, end position, strand, frame, score). This file is not automatically generated. To obtain this file set the -e flag to 1 An example of the "[output_file_name].out" file looks like this,

gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome 108 440 - 3 1.378688 337 2799 + 1 1.303498 2801 3733 + 2 1.317386 3734 5020 + 2 1.293573 5234 5530 + 2 1.354725 5683 6459 - 1 1.290816 6529 7959 - 1 1.326412 8238 9191 + 3 1.286832 9306 9893 + 3 1.317067

  1. The "[output_file_name].ffn" file lists nucleotide sequences corresponding to the putative genes in "[output_file_name].out". This file is not automatically generated. To obtain this file set the -d flag to 1 An example of the "[output_file_name].out" file looks like this,

gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=108 e nd=338 strand=- GTTGTTACCTCGTTACCTTTGGTCGAAAAAAAAAGCCCGCACTGTCAGGTGCGGGCTTTTTTCTGTGTTTCCTGTACGCGTCAGCCCGCACCGTTACCTG TGGTAATGGTGATGGTGGTGGTAATGGTGGTGCTAATGCGTTTCATGGATGTTGTGTACTCTGTAATTTTTATCTGTCTGTGCGCTATGCCTATATTGGT TAAAGTATTTAGTGACCTAAGTCAA gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=343 e nd=2799 strand=+ TTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCC TCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTAT TTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACAT GTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCG

  1. The third file is '[output_file_name].faa", which lists amino acid sequences corresponding to the putative genes in "[output_file_name].out". For example,

gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=108 e nd=338 strand=- VVTSLPLVEKKSPHCQVRAFFCVSCTRQPAPLPVVMVMVVVMVVLMRFMDVVYSVIFICLCAMPILVKVFSDLSQ gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=343 e nd=2799 strand=+ LKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKH VLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLG RNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRDEDE LPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVG DGMRTLRGISAKFFAALARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSWLKNKHIDLRVCGV ANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKF LYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEP VLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLPLVLRGYGAGNDVTA AGVFADLLRTLSWKLGV gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655, complete genome start=2801 end=3733 strand=+ VKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSAC SVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCI AHGRHLAGFIHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVADWLGKNYLQNQEGFVHICRLD TAGARVLEN

Differences Between FragGeneScan-Plus and FragGeneScan

FragGeneScan-Plus supports a highly efficient threading model allowing it to scale to modern datasets. It also supports standard input output through the usage of pipes allowing it to be implemented as a part of UNIX pipelines. A variety of issues have been resolved allowing a more fluid bug free usage. If a issue is encountered please post it to the github issues with the command used to generate the bug, a report of what the bug is and the data used to generate the bug.

Citation

If you enjoy the software and are using it in an academic setting please cite our paper which can be found here: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7300341&tag=1