GffRead can be used to simply read an annotation file in a GFF format, and print it in either GFF3 (default) or GTF2 format (with the -T option), while discarding any non-trasncript features and optional attributes. It can also report some potential issues found in the input GFF records. The command line for such a quick GFF/GTF file cleanup would be:
gffread -E annotation.gff -o ann_simple.gff
This will create a minimalist GFF3 re-formatting of the transcript records found in the input file (annotation.gff
in this example).
The -E option directs GffRead to "expose" (display warnings about) any potential formatting issues
encountered while parsing the input file.
In order to obtain the GTF2 version of the same transcript records, the -T
option should be added:
gffread annotation.gff -T -o annotation.gtf
GffRead can be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. For this operation a fasta file with the genomic sequences has to be provided as well. This can be accomplished with a command line like this:
gffread -w transcripts.fa -g genome.fa annotation.gff
The file genome.fa
in this example would be a multi-fasta file with the chromosome/contig sequences of the target genome.
This also requires that every contig or chromosome name found in the 1st column of the input GFF file
(annotation.gff
in this example) must have a corresponding sequence entry in the genome.fa
file.
gffread --table @id,@chr,@start,@end,@strand,@exons,Name,gene,product \
-o annotation.tbl annotation.gff
This shows how the --table
option can make a tab delimited table out of a GFF3 input.
The output
directory contains all the output files that should be generated by the above examples.