Skip to content
NCBI Prokaryotic Genome Annotation Pipeline
Common Workflow Language Python Shell Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github/ISSUE_TEMPLATE Update issue templates Nov 18, 2019
MG37 Use a structurally valid dummy BioProject accession. Nov 23, 2018
bacterial_annot typo; JIRA: PGAP-1602 Oct 3, 2019
bacterial_kmer working input with real file location for wf_bacterial_kmer.cwl; JIRA… Nov 22, 2019
bacterial_mobile_elem introduce extra dependencies on initial standard diagnostic evaluatio… Apr 24, 2019
bacterial_ncrna specify memory requirements on those steps that need more than 1G Jul 11, 2019
bacterial_noncoding specify memory requirements on those steps that need more than 1G Jul 11, 2019
bacterial_trna add -H, we probably do not consume the part of output affected by thi… Oct 12, 2019
clade_assign JIRA PGAPX-204: Remove empty hints from all cwl files Oct 25, 2018
common common May 28, 2018
expr remove exclusively ANI reference data items from ExpressionTools; JIR… Nov 25, 2019
genomic_source fixed typos; JIRA: PGAPX-249 Dec 20, 2018
input_template LEGO blocks for making a 'standard' template; JIRA: GP-24484 Jun 28, 2018
input_template2 Use a structurally valid dummy BioProject accession. Nov 23, 2018
progs pass taxon_db bacterial-kmer -> tt_kmer_top_n_extract -> kmer_top_n_e… Nov 18, 2019
protein_alignment specify memory requirements on those steps that need more than 1G Jul 11, 2019
scripts Added singularity to command-line help Dec 6, 2019
split_jobs JIRA PGAPX-204: Remove empty hints from all cwl files Oct 25, 2018
spurious_annot add hmm_search scatter/gather Sep 6, 2018
task_types pass ref_assembly_id: wf_bacterial_kmer->tt_ani_top_n->ani_top_identi… Nov 18, 2019
taxonomy_check_16S JIRA PGAPX-204: Remove empty hints from all cwl files Oct 25, 2018
user_genome Use a structurally valid dummy BioProject accession. Nov 23, 2018
vecscreen current state of coding stored in feature branch; JIRA: PGAPX-483 Sep 6, 2019
.gitignore Wratko's Python runner, initial version Feb 13, 2019
GeneMarkS_Software_License.txt Added line breaks Jul 19, 2018
LICENSE.md Added file to be in compliance with github licensing standards. Oct 2, 2018
README.md Updated GeneMark reference and replaced GeneMarkS with GeneMarkS-2+ i… Jul 26, 2019
ani.cwl tested ani.cwl. We are ready, cwl-wise to pass this to pgap.py; JIRA:… Nov 22, 2019
bacterial_prepare_unannotated.cwl JIRA PGAPX-204: Remove empty hints from all cwl files Oct 25, 2018
cache_entrez_gene.cwl introduce extra dependencies on initial standard diagnostic evaluatio… Apr 24, 2019
input.yaml cleanup Jan 3, 2019
input_simple.yaml save input.yaml - our flagship 'constant' part of yaml input as input… Dec 14, 2018
pgap.cwl Restore manually the commits; JIRA: PGAPX-453 Oct 25, 2019
prepare_user_input.cwl Added exec flag, changed prefix of non-workflow, set dockerPull requi… Jul 10, 2018
prepare_user_input.input.yaml added workflow for template prepartion; tested; JIRA: GP-24438 Jun 28, 2018
prepare_user_input2.cwl Restore manually the commits; JIRA: PGAPX-453 Oct 25, 2019
preserve_annot_markup.cwl JIRA PGAPX-204: Remove empty hints from all cwl files Oct 25, 2018
wf_common.cwl remove exclusively ANI reference data items from ExpressionTools; JIR… Nov 25, 2019
wf_pgap_simple.cwl Added default location for input/ directory Dec 20, 2018

README.md

PGAP

NCBI Prokaryotic Genome Annotation Pipeline

NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements.

NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Automatic Annotation Pipeline (PGAAP; see Pubmed Article) developed in 2005 has been replaced with an upgraded version that is capable of processing a larger data volume.

Instructions

To run the PGAP pipeline you will need Linux, Docker, CWL (Common Workflow Language), and about 30GB of supplemental data. We provide instructions here for running under the CWL reference implementation, cwltool. Full instructions for installing, running, and interpreting the results may be found in our wiki.

References

NCBI

NCBI prokaryotic genome annotation pipeline.
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J.
Nucleic Acids Res. 2016 Aug 19;44(14):6614-24. Epub 2016 Jun 24.

RefSeq: an update on prokaryotic genome annotation and curation.
Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O'Neill K, Li W, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu F, Marchler GH, Song JS, Thanki N, Yamashita RA, Zheng C, Thibaud-Nissen F, Geer LY, Marchler-Bauer A, Pruitt KD.
Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860.

GeneMarkS-2+

Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes
Lomsadze A, Gemayel K, Tang S, Borodovsky M.
Genome Research. 2018; 28(7):1079-1089.

TIGRFAMs

TIGRFAMs: a protein family resource for the functional identification of proteins.
Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O.
Nucleic Acids Res. 2001 Jan 1;29(1):41-3.

The TIGRFAMs database of protein families.
Haft DH, Selengut JD, White O.
Nucleic Acids Res. 2003 Jan 1;31(1):371-3.

TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes.
Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O.
Nucleic Acids Res. 2007 Jan;35(Database issue):D260-4. Epub 2006 Dec 6.

TIGRFAMs and Genome Properties in 2013.
Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E.
Nucleic Acids Res. 2013 Jan;41(Database issue):D387-95. doi: 10.1093/nar/gks1234. Epub 2012 Nov 28.

LICENSING TERMS

NCBI PGAP CWL

The NCBI PGAP CWL and other code authored by NCBI is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite NCBI in any work or product based on this material.

Third-party tools

The Docker image contains third-party tools distributed under the licensing terms of the respective license holders.

GeneMarkS-2+

GeneMarkS-2+ is distributed as part of PGAP with limited rights of use and redistribution from the Georgia Tech Research Corporation. See the full text of the license.

TIGRFAMs

The original TIGRFAMs database was a research project of the J. Craig Venter Institute (JCVI) . TIGRFAMs, short for The Institute for Genomic Research's database of protein families, is a collection of manually curated protein families focusing primarily on prokaryotic sequences. It consists of hidden Markov models (HMMs), multiple sequence alignments, Gene Ontology (GO) terminology, Enzyme Commission (EC) numbers, gene symbols, protein family names, descriptive text, cross-references to related models in TIGRFAMs and other databases, and pointers to the literature. The work has been described in the articles listed in the References section above and use of the TIGRFAMs database must grant proper attribution by citing those four articles.

As of April 2018, rights were transferred to the National Center for Biotechnology Information (NCBI), National Library of Medicine, NIH, for the data to be made available for distribution under a Creative Commons Attribution-ShareAlike 4.0 license. Please see (https://creativecommons.org/licenses/by-sa/4.0/) for a brief summary of the license and (https://creativecommons.org/licenses/by-sa/4.0/legalcode) to see the full text.

You can’t perform that action at this time.