write_fasta_from_gff.py silently ignores a (potentially large) portion of the input gff #32

jonathancrabtree · 2015-03-23T19:50:47Z

Running write_fasta_from_gff.py on the output of convert_metagenemark_gff_to_gff3.py, I observed some large discrepancies between the number of CDS features in the GFF3 file and the number of CDS features written by write_fasta_from_gff.py In one case the GFF3 file contained 16042 CDS features, but the FASTA output contained only 10299 sequences, a loss of 5743 CDS sequences, ~36% of the total.

jonathancrabtree · 2015-03-23T20:02:44Z

Debugging revealed that the problem lies with get_gff3_features in biocodegff.py. There are two problems with this function:

It interprets any GFF line starting with "##FASTA" as the beginning of the GFF3 FASTA section. I don't think this is correct, at least based on my reading of the GFF3 spec. I'd say that "##FASTA" (possibly with trailing whitespace) is the only thing that should be allowed on the line.
It fails to raise an error when it sees (what it thinks is) FASTA sequence data that is not preceded by a valid FASTA defline. This is actually the more serious of the two errors, and is the one that allows thousands of input genes to be silently dropped.

The problem in this particular case is that MetaGeneMark prints the protein sequences in comments in the GFF, like this:

##Protein 11140
##MAQFRVDSEQIQQAAAAVGTSVSAIRDAVNGMYTNLQQLQSVWTGSAATQFASTAQQWRA 
##AQQQMEQSLEAIQQAMQHASGVYLDAEAQATSLFGMG
##end-Protein

convert_metagenemark_gff_to_gff3.py echoes these comments to the GFF3 file unchanged, which it should not do, because..."FASTA" is a valid amino acid sequence. So all it takes is for "FASTA" to appear at the beginning of any protein line and biocodegff.py will ignore the entire rest of the file, without printing any errors or warnings:

##Protein 10298
##MRMQKVQKKLSETSFQDRLDFAATHSKTSVLRMCNSQCTGLCARDVLRARARFGSNALER
##KKQNSLASRLVQAFINPFSCILFVLALISCINDMVLPSLSLLGQSPDDFDCTTFTIITTM 
##ITVSGILRFVQESKSANAAQKLMDMVRTTVSCLRDGDADEDAVSPSTSATASPSASASLA
##NFSFEDKAKLTEIQLDSLVVGDIVYLSTGDIVPADVRILSACDLFVNEASLTGESELVEK
##FASTATKAANICDYENLAFMGTTVISGSAWAVVVSVGAHTMFGTLARALSEKDGETSFSR
<everything from this point on, except FASTA sequence, is ignored by the parser>
##DINSLSWVLIRFMIVMVPVVLAINGFTKGDW
##end-Protein

I have a partial proposed solution, which consists of:

Requiring that the "##FASTA" be the only thing on the line (although note that this doesn't solve the MetaGeneMark bug, just makes it less likely to be triggered.)
Throwing an exception when what the parser thinks is the FASTA section is not well-formed. In particular, all FASTA sequence must be preceded by a FASTA defline.

Finally, I'll file this as a separate issue, but I think convert_metagenemark_gff_to_gff3.pl should be modified so that it doesn't produce GFF3 that's ambiguous/malformed i.e., no "##FASTA" lines unless it's at the start of a valid FASTA section. One possibility might be to tack an extra "#" on all the echoed comment lines.

jonathancrabtree · 2015-03-23T20:07:51Z

Closed this prematurely!

Fixes for parsing of GFF3 ##FASTA section. Closes #32.

jonathancrabtree closed this as completed Mar 23, 2015

jonathancrabtree reopened this Mar 23, 2015

jonathancrabtree mentioned this issue Mar 23, 2015

convert_metagenemark_gff_to_gff3.py produces invalid GFF3 #33

Closed

jonathancrabtree mentioned this issue Mar 23, 2015

Fixes for parsing of GFF3 ##FASTA section. Closes #32. #34

Merged

jorvis closed this as completed in 67a6e2e Mar 23, 2015

jorvis added a commit that referenced this issue Mar 23, 2015

Merge pull request #34 from jonathancrabtree/master

d2488e9

Fixes for parsing of GFF3 ##FASTA section. Closes #32.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_fasta_from_gff.py silently ignores a (potentially large) portion of the input gff #32

write_fasta_from_gff.py silently ignores a (potentially large) portion of the input gff #32

jonathancrabtree commented Mar 23, 2015

jonathancrabtree commented Mar 23, 2015

jonathancrabtree commented Mar 23, 2015

write_fasta_from_gff.py silently ignores a (potentially large) portion of the input gff #32

write_fasta_from_gff.py silently ignores a (potentially large) portion of the input gff #32

Comments

jonathancrabtree commented Mar 23, 2015

jonathancrabtree commented Mar 23, 2015

jonathancrabtree commented Mar 23, 2015