-
Notifications
You must be signed in to change notification settings - Fork 88
Output Files
Francoise Thibaud-Nissen edited this page Dec 19, 2022
·
16 revisions
Here are the expected output files for a successful annotation.
-
ani-tax-report.txt: Results of the taxonomy check in text format. See the Taxonomy Check documentation for a description. (Only produced if using the flag
--taxcheck
or--taxcheck-only
). -
ani-tax-report.xml: Results of the taxonomy check in XML format. See the Taxonomy Check documentation for a description. (Only produced if using the flag
--taxcheck
or--taxcheck-only
). - annot.faa: Protein products annotated on the genome in FASTA format. The FASTA definition line is formatted as a type general identifier (gnl|extdb|<locus_tag>) plus the product name. You can provide the locus tag prefix of your choice in the input metadata YAML file (see the Note about locus tags, and how to prepare your Input files).
- annot.fna: Genomic sequence(s) in FASTA format, as provided on input
- annot.gbk: Annotated genomic sequence(s) in GenBank flat file format. Genes use the <locus_tag>, and protein_ids use the format extdb:<locus_tag>
- annot.gff: Annotation of the genomic sequence(s) in Generic Feature Format Version 3 (GFF3). Sequence identifiers (column 1) correspond to the identifier in the input FASTA file. Identifiers for genes use the format gene-locus_tags (gene-<locus_tag>), and identifiers for CDSs use the format cds-locus_tag (cds-<locus_tag>), matching locus tags in the annot.gbk file. Protein_ids use the format extdb:<locus_tag> similarly to the annot.faa file. Additional information about NCBI's GFF files is available at README_GFF3.txt.
- annot.sqn: Submission-ready annotated sequence(s). To submit the genome and the annotation to GenBank, please go to the genome submission portal.
-
annot_cds_from_genomic.fna: nucleotide sequences in FASTA format of all CDS features annotated on the assembly, based on the genome sequence. Note: Pseudogenes annotated with CDS features are included, and may be disrupted by frameshifting indels or in-frame stop codons. Pseudogene features can be identified and screened out based on the presence of a
[pseudo=true]
qualifier in the defline. - annot_translated_cds.faa: protein sequences in FASTA format of CDS features annotated on the genomic records. The sequences are the conceptual translation of the nucleotide sequence provided in the annot_cds_from_genomic.fna.gz file.
-
annot_with_genomic_fasta.gff: annotation in GFF format followed by the
## FASTA
pragma and the genomic sequence(s) in FASTA format. - calls.tab: Coordinates of detected foreign sequence in tab-delimited format. Colummns are: sequence identifier, whether the sequence is partially (M) or entirely made of foreign sequence (X), range of foreign sequence, apparent source, source category. Note: this file is only present in the output if foreign spans were detected.
-
checkm.txt: Annotated assembly completeness and contamination as calculated by CheckM. See a full description of the file format at this location.
Note: 1) The CheckM calculation is performed on the proteins produced by PGAP, 2) the set of markers used by CheckM is determined by the species associated with the genome as provided by the input yaml file, or as returned by the taxonomy check if using
--taxcheck --auto-correct-tax
. - cwltool.log: Execution log