Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Data format standards

mattb112885 edited this page Nov 7, 2013 · 9 revisions

Input file formats

Genbank files (required)

You need ONE genbank file for every organism. Concatinate the genbank files for all the contigs into a single file.

The following information is taken from the genbank files to generate a raw file and the organisms file:

  1. Organism (in the /organism="[organism name]" line )
  2. Tax ID (in a /db_xref="Taxon:[taxid]" line )
  3. Contig IDs and sequences
  4. Gene locations, annotations and sequences
  5. Optionally, gene names and locus tags (identified by a db_xref)

These should be present in all the genbank files from Genbank (ftp.ncbi.nih.gov/genomes/Bacteria). It is assumed that any duplicates taxon IDs or organism names that are present in the Genbank file are all identical.

ITEP input scripts will only work if Biopython can successfully parse your Genbank file. This won't be a problem from most data sources (tested to work with JGI, NCBI, RAST, and PUBSEED Genbank files and the ones generated with our KBase interface).

Raw file format (automatically generated from Genbank files using convertGenbank2table.py)

Raw files are tab-delimited files containing information needed for our analysis. They are automatically generated from Genbank files and are placed in the ${ROOTDIR}/raw folder. In case you're curious this is identical to the "spreadsheet (tab delimited)" format offered by RAST on the online interface, so if you want to you can go in there and download these directly instead of running convertGenbank2table.py

The columns of a raw file are as follows:

contig_id  feature_id  type  location  start  stop  strand  function  aliases figfam evidence_codes nucleotide_sequence aa_sequence
  • The feature_id for any protein-encoding gene must have the format:

    fig|#.#.peg.# (e.g. fig|83333.1.peg.1)

  • The first two numbers (83333.1) must match the organism ID for the organism containing that gene.

  • The overall feature ID must be unique for each gene.

The Type column should be "peg" for all proteins. Anything that is not a protein is ignored.

The Start/stop columns refer to the the start/stop of the actual gene on the specified contig (start > stop for - strand genes).

The start\stop are 1-indexed from the beginning of the contig on which the feature is found.

Strand is + or -

Function is the functional annotation.

nucleotide_sequence is the nucleotide sequence encoding for the protein and aa_sequence is the translated amino acid sequence.

All other fields (location, aliases, figfam, evidence_codes, ...) are not used for anything by ITEP.

Organism file format (automatically generated)

A file called "organisms" is automatically generated from the names of the Genbank files in genbank/ and from the organism field of those Genbank files. It is a two-column table with organism name in the first column and organism ID in the second column.

The organism ID matches the regular expression "\d+.\d+".

Organism names can have spaces or some special characters but semicolons and quotes are not allowed. Many functions that output formats that are sensitive to special characters (SVG, Newick) will sanitize the names of organisms and\or their IDs by replacing all non-alphanumeric characters with underscores.

Groups file format

The groups file is automatically-generated with an "all" group containing all organisms in the ITEP database. Other groups can be added manually or with the help of the addGroupByMatch.py function. It is a two-column tab-delimited table; the first column contains the group's name and the second column is a semicolon-delimited list of organisms in that group.

Organism names in the groups file must match the organism names in the organisms file exactly. You are not allowed to have multiple group names for the same group of organisms or to have the same name refer to different groups of organisms.

Output file formats

Most of the ITEP scripts output tab-delimited files with various fields (see individual functions for details). The others support specific widely-used file formats:

  • Alignments: FASTA
  • Trees: Newick
  • Graphs: GML
  • Images: SVG or PNG

Descriptions follow for the formats of some of the data tables that various functions output.

Blastp and Blastn results

The BLASTP and BLASTN tables use -outfmt 6 with addition of the query and target self-bit scores on the end. The column order is as follows:

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore query_selfbit target_selfbit

Where: qseqid is the query's sequence ID, sseqid is the subject's (target's) sequence ID, pident is the percent identity across the HSP, length is the length of the HSP, mismatch is the number of mismatches in the HSP, gapopen is the number of gap openings in the HSP, qstart is the location within the query (amino acid for blastp or nucleotide position for blastn) where the HSP starts, qend is the location within the query where it ends, sstart is the location within the target where the HSP starts, send is the location within the target where the HSP ends, evalue is the E-value and bitscore is the bit score. query selfbit is the bit score resulting from BLASTing the query gene against itself and target selfbit is the bit score from BLASTing the target gene against itself.

tBLASTn results

The tBLASTn results table is rather heavily modified and appended to include information about annotated genes that overlap with hits to the target contig (if the contig is found in the database). The columns are as follows:

queryid, querylen, subcontig, organism, tblaststart, tblastend, tblastlen, queryoverlappct, evalue, bitscore, hitframe, strandedString, targetgeneid, targetannotation, targetgenelen, targetoverlappct, TBLASTN_hitID

If there are no annotated genes in the target region you will get the following: queryid is the ID of the query gene, querylen is the length of the query gene, subcontig is the subject (target) contig, organism is the organism in which the target contig is found, tblaststart is the nucleotide position (on the target contig) of the beginning of the HSP, tblastend is the nucleotide position (on the target contig) of the end of the HSP, tblastlen is the length of the HSP, queryoverlappct is the percentage of the query gene that overlapped with the HSP, evalue is the E-value for the hit, bitscore is the bit score for the hit, and hitframe is the frame (1, 2 or 3).

The field strandedString is "NOGENE" if no overlapping annotated genes were found in the homologous region on the target contig, "SAMESTRAND" if an annotated gene was found on the same strand as the hit and "OTHERSTRAND" if an annotated gene was found overlapping the hit on the opposite strand.

If there ARE annotated genes in the target region then each hit is repeated once for each annotated gene and the following information about the overlapping gene is appended: targetgeneid is the gene ID for the annotated gene on the target contig, targetannotation is the annotation of that gene, targetgenelen is the length of the target gene, and targetoverlappct is the percent overlap between the target gene and the HSP (very important!).

Finally, each hit is given a unique tBLASTn ID which encodes information about the contig and the location within it. This is used in scripts to identify the neighborhood of the homologous region.

"Geneinfo" tables

Some scripts (e.g. db_getGeneInformation.py and db_getClusterGeneInformation.py) return "geneinfo" tables. The former script returns the following set of columns for a set of query genes:

geneID  organism   organism_id   organism_id  contig_id  start   stop   strand   strandnum   annotation   NT_seq   AA_seq

Where geneid is the gene ID, organism is the organism in which the gene is found, organism_id is the organism ID (matches \d+.\d+) for that organism, contig id is the ITEP contig ID in which the gene is found, start is the location in the contig for the first nucleotide of the start codon (starting from 1), stop is the location in the contig of the last nucleotide of the stop codon (starting from 1), strand is + or - depending on the strand of the gene, strandnum is +1 if strand is + and -1 if strand is -, annotation is the gene's annotated function, NT seq is the nucleotide sequence of the gene and AA seq is the translated amino acid sequence.

The latter (db_getClusterGeneInformation.py) script takes a cluster\runID pair (or multiple pairs) as input and returns exactly the same format as described above, EXCEPT that it appends the cluster and run ID given to it as inputs as the last two columns of the output array. If you need the cluster and run IDs from your geneinfo table you should use db_getClusterGeneInformation.py and not db_getGeneInformation.py!

Clone this wiki locally