Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phobius fails with error at line 408 if funannotate annotate is used on FA+GFF #508

Closed
estolle opened this issue Oct 29, 2020 · 13 comments
Closed

Comments

@estolle
Copy link

estolle commented Oct 29, 2020

Hi

I usually had no problems with the phobius step during annotation if I had run te predict and then annotation steps of funannotate

But now, in order to get a valid gbk file for existing NCBI RefSeq annotations, I tried to directly run only the annotate step with the FA and the GFF. I had to fix the gff from ncbi as EVM always fails to parse it (and other tools too). I did this using gffread and AGAT (see below). Most annotate steps finished but then it fails at phobius.

Investigating it seems this is caused by StopCodons being included in the protein fasta sequence (see below). If I remove the * (STOP codon) it runs. The GFF has start and stop codon entries for each mRNA so I guess its coming from there. I wonder if this is the expected format of the GFF (I am now trying to remove these codons and run again) or if a check could/should be implemented here.

phobius.pl -short rna-XM_016916622.2.fa

SEQENCE ID TM SP PREDICTION
Could not read provided fasta sequence at /usr/local/bin/phobius.pl line 408.

cat rna-XM_016916622.2.fa

rna-XM_016916622.2 nbisL1-mrna-21732
MEKQDTLLQIHLERPIYEQETLNEDYHYEKPKTFDFHNALSDLKYKNWKSCFISAIPSIY
WLKNYNWKENLMSDIISGLTVAIMHIPQGMAYALLGNVPPVVGIYMAFFPVLMYFFFGTS
KHVSMGTFAVVCLMTGKTVTSYSISHNEITTPNVTTTLPDLPGEYLYTPIQVATAVTLMV
GIFQIIMYIFHLGIISTLLSDPLVNSFTTGAAVCVLISQIKDLLGLKISKQKGYFKFIFT
LIDILKEIQNTNLTAVFISLITIVGLISNNEFLKPWISKKCSIPIPIELIAVVSGTLISK
YFCFSTKYNIQVVGDIPTGLPVPTIPTFNLLHLVAMDSIAITMVSYTITISMALIFAQKL
NYKINSNQELLAMGLSNVVGSFFSCMPVSASLSRSLIQQTVGGRTQIASVISCTVLLIIL
LWIGPFFEPLPRSVLASIIIVALKGMFQQANQLIKFWKLSKCDALIWISTFLTVVIISID
IGLLTGIIISLAIILLQSVRPYICLLGYIPNTDLYLDMSRFKAAVEIPGIKIFHYCGTLN
FANINHFKSELYKLIGINPKKIIEHKIKLREKGIYINTEDSEEKQELQCIIMDMSALSYI
DSSGVITLNSVMKEFQQIDIHFYFVSCTSPIFETIRKCDLYLHGTLTFKIFATIQDAKTY
LEKELYSR*

My steps to format the GFF downloaded from NCBI for my insect genome:

#conda install -c bioconda agat

conda activate gff

Inputfile (prefix same for .fa and .gff), without file extension

INPUTFILENAME="GCF_003710045.1_USU_Nmel_1.2_genomic"

#exclude "cDNA-matches", MT scaffold "miRNAs", as it cannot be parse properly it seems
gffread -C --no-pseudo -o- -g $INPUTFILENAME.fa $INPUTFILENAME.gff | grep -vP 'cDNA_match|NC_001566.1|miRBase:' > $INPUTFILENAME.cleaned.gff
agat_convert_sp_gxf2gxf.pl -v 2 --gff $INPUTFILENAME.cleaned.gff > $INPUTFILENAME.cleaned.agat.out

#cut out AGAT text added to GFF
##cat $INPUTFILENAME.cleaned.agat.out | grep -n "GFF3 file parse"
LINENR=$(cat $INPUTFILENAME.cleaned.agat.out | grep -n "GFF3 file parse"|cut -d":" -f1)
TOTALLINES=$(cat $INPUTFILENAME.cleaned.agat.out | wc -l)
LINESEXTRACT=$(echo "$TOTALLINES-$LINENR" | bc -l)

#remove last info lines from AGAT and fix a Capitalization Typo for a specific gene
cat $INPUTFILENAME.cleaned.agat.out | tail -n $LINESEXTRACT | grep -v "usage: /home" | grep -v "Job done in" | sed "s/Parent=gene-Apd-3/Parent=gene-apd-3/g" > $INPUTFILENAME.fixed.gff

#check GFF
gt gff3validator $INPUTFILENAME.fixed.gff
export EVM_BASE_DIR=/opt/EVidenceModeler-1.1.1
$EVM_BASE_DIR/EvmUtils/gff3_gene_prediction_file_validator.pl $INPUTFILENAME.fixed.gff

rm -f $INPUTFILENAME.cleaned.gff $INPUTFILENAME.cleaned.agat.out $INPUTFILENAME.cleaned.gff

@nextgenusfs
Copy link
Owner

Yuck. GFF reformatting is never pretty. Did you try to pass the NCBI GFF3 file + FASTA directly into funannotate annotate? I've tried to improve the GFF3 parsing abilities but curious to know how/what it dies on.

Per your fasta file above, maybe copy/paste error but there is no > in the header.

@estolle
Copy link
Author

estolle commented Oct 29, 2020

Yeah the fasta has a ">", seems this a copy/paste miss.

The EVM validity check and funannotate die on the GFF because it 1.) has no Parents for basically all of the entries, i.e. it missed the "gene" field and has only the "mRNA" field. Also problematic were apparently funny entries like pseudogenes, miRNAs, and "cDNA matches". The file in question is this:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz

related gbff/gtf/fna:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gtf.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gbff.gz

I also tried converting the gtf (without success thus far - but I want to try AGAT on it) or gbff but any tool (external or funannotate util) die on them too. Was anybody ever able to parse any ncbi GFF file?

Regarding the problem above I noticed that the original GFF contains UTRs but thats it. The Stop (and Start) codon are part of the CDS. Thats how they end up making problems here. I do not know a tool to remove them. AGAT offers to introduce a separate feature for start and stop but the CDS remains identical also if I convert to gtf formats (which should not contain the stop codons).

@nextgenusfs
Copy link
Owner

nextgenusfs commented Oct 29, 2020

Okay, I think should be easy to make a little gff-sanitize function for these NCBI data as its probably prevalent enough that people will use it. Shouldn't take me too long....

Well you weren't joking -- these annotations are a hot mess.

@estolle
Copy link
Author

estolle commented Oct 30, 2020

I seem to not be lucky with removing the StopCodon from those RefSeq GFFs, conversion from GTF with e.g. gffread add them. As I gathered, this is supposed to be the case with GFF. Perhaps if, during the translation to Protein, the translated Stop Codon (*) cold be removed if present. I tried to do this using Unix Tools, but if I restart funannotate annotate, it re-creates the Protein.fa from the Input FA+GFF

@nextgenusfs
Copy link
Owner

The code should be removing the asterisks in the protein file as it also causes problems in interproscan. I will double check.

@nextgenusfs
Copy link
Owner

If I relax the parser so that it doesn't just die if there is no Parent=, I'm still getting problems with some of these gene models, here is one that I don't think is correct -- this suggests that this gene has a protein coding transcript and then several non-coding transcripts. Funannotate encodes a gene model as a single type (tRNA, mRNA, rRNA, etc). When I'm annotating something, I would call the ncRNA's a different gene.... It could eventually be supported, but there are likely >50 areas in the code this could cause problems with.

$ grep 'LOC551580' GCF_003254395.2_Amel_HAv3.1_genomic.gff 
NC_037638.1	Gnomon	gene	9273	12174	.	-	.	ID=gene-LOC551580;Dbxref=BEEBASE:GB42195,GeneID:551580;Name=LOC551580;gbkey=Gene;gene=LOC551580;gene_biotype=protein_coding
NC_037638.1	Gnomon	transcript	9274	12174	.	-	.	ID=rna-XR_001705491.2;Parent=gene-LOC551580;Dbxref=GeneID:551580,Genbank:XR_001705491.2,BEEBASE:GB42195;Name=XR_001705491.2;gbkey=misc_RNA;gene=LOC551580;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 57 samples with support for all annotated introns;product=ubiquitin-related modifier 1%2C transcript variant X3;transcript_id=XR_001705491.2
NC_037638.1	Gnomon	exon	11812	12174	.	-	.	ID=exon-XR_001705491.2-1;Parent=rna-XR_001705491.2;Dbxref=GeneID:551580,Genbank:XR_001705491.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X3;transcript_id=XR_001705491.2
NC_037638.1	Gnomon	exon	11054	11121	.	-	.	ID=exon-XR_001705491.2-2;Parent=rna-XR_001705491.2;Dbxref=GeneID:551580,Genbank:XR_001705491.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X3;transcript_id=XR_001705491.2
NC_037638.1	Gnomon	exon	10913	10994	.	-	.	ID=exon-XR_001705491.2-3;Parent=rna-XR_001705491.2;Dbxref=GeneID:551580,Genbank:XR_001705491.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X3;transcript_id=XR_001705491.2
NC_037638.1	Gnomon	exon	9779	9827	.	-	.	ID=exon-XR_001705491.2-4;Parent=rna-XR_001705491.2;Dbxref=GeneID:551580,Genbank:XR_001705491.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X3;transcript_id=XR_001705491.2
NC_037638.1	Gnomon	exon	9274	9546	.	-	.	ID=exon-XR_001705491.2-5;Parent=rna-XR_001705491.2;Dbxref=GeneID:551580,Genbank:XR_001705491.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X3;transcript_id=XR_001705491.2
NC_037638.1	Gnomon	transcript	9274	12174	.	-	.	ID=rna-XR_001705490.2;Parent=gene-LOC551580;Dbxref=GeneID:551580,Genbank:XR_001705490.2,BEEBASE:GB42195;Name=XR_001705490.2;gbkey=misc_RNA;gene=LOC551580;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 65 samples with support for all annotated introns;product=ubiquitin-related modifier 1%2C transcript variant X2;transcript_id=XR_001705490.2
NC_037638.1	Gnomon	exon	11579	12174	.	-	.	ID=exon-XR_001705490.2-1;Parent=rna-XR_001705490.2;Dbxref=GeneID:551580,Genbank:XR_001705490.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X2;transcript_id=XR_001705490.2
NC_037638.1	Gnomon	exon	11054	11121	.	-	.	ID=exon-XR_001705490.2-2;Parent=rna-XR_001705490.2;Dbxref=GeneID:551580,Genbank:XR_001705490.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X2;transcript_id=XR_001705490.2
NC_037638.1	Gnomon	exon	10913	10994	.	-	.	ID=exon-XR_001705490.2-3;Parent=rna-XR_001705490.2;Dbxref=GeneID:551580,Genbank:XR_001705490.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X2;transcript_id=XR_001705490.2
NC_037638.1	Gnomon	exon	9779	9827	.	-	.	ID=exon-XR_001705490.2-4;Parent=rna-XR_001705490.2;Dbxref=GeneID:551580,Genbank:XR_001705490.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X2;transcript_id=XR_001705490.2
NC_037638.1	Gnomon	exon	9274	9546	.	-	.	ID=exon-XR_001705490.2-5;Parent=rna-XR_001705490.2;Dbxref=GeneID:551580,Genbank:XR_001705490.2,BEEBASE:GB42195;gbkey=misc_RNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X2;transcript_id=XR_001705490.2
NC_037638.1	Gnomon	mRNA	9273	12174	.	-	.	ID=rna-XM_623972.6;Parent=gene-LOC551580;Dbxref=GeneID:551580,Genbank:XM_623972.6,BEEBASE:GB42195;Name=XM_623972.6;gbkey=mRNA;gene=LOC551580;model_evidence=Supporting evidence includes similarity to: 4 ESTs%2C 7 Proteins%2C and 95%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 66 samples with support for all annotated introns;product=ubiquitin-related modifier 1%2C transcript variant X1;transcript_id=XM_623972.6
NC_037638.1	Gnomon	exon	11201	12174	.	-	.	ID=exon-XM_623972.6-1;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XM_623972.6,BEEBASE:GB42195;gbkey=mRNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X1;transcript_id=XM_623972.6
NC_037638.1	Gnomon	exon	11054	11121	.	-	.	ID=exon-XM_623972.6-2;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XM_623972.6,BEEBASE:GB42195;gbkey=mRNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X1;transcript_id=XM_623972.6
NC_037638.1	Gnomon	exon	10913	10994	.	-	.	ID=exon-XM_623972.6-3;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XM_623972.6,BEEBASE:GB42195;gbkey=mRNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X1;transcript_id=XM_623972.6
NC_037638.1	Gnomon	exon	9779	9827	.	-	.	ID=exon-XM_623972.6-4;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XM_623972.6,BEEBASE:GB42195;gbkey=mRNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X1;transcript_id=XM_623972.6
NC_037638.1	Gnomon	exon	9273	9546	.	-	.	ID=exon-XM_623972.6-5;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XM_623972.6,BEEBASE:GB42195;gbkey=mRNA;gene=LOC551580;product=ubiquitin-related modifier 1%2C transcript variant X1;transcript_id=XM_623972.6
NC_037638.1	Gnomon	CDS	11201	11244	.	-	0	ID=cds-XP_623975.1;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XP_623975.1,BEEBASE:GB42195;Name=XP_623975.1;gbkey=CDS;gene=LOC551580;product=ubiquitin-related modifier 1;protein_id=XP_623975.1
NC_037638.1	Gnomon	CDS	11054	11121	.	-	1	ID=cds-XP_623975.1;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XP_623975.1,BEEBASE:GB42195;Name=XP_623975.1;gbkey=CDS;gene=LOC551580;product=ubiquitin-related modifier 1;protein_id=XP_623975.1
NC_037638.1	Gnomon	CDS	10913	10994	.	-	2	ID=cds-XP_623975.1;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XP_623975.1,BEEBASE:GB42195;Name=XP_623975.1;gbkey=CDS;gene=LOC551580;product=ubiquitin-related modifier 1;protein_id=XP_623975.1
NC_037638.1	Gnomon	CDS	9779	9827	.	-	1	ID=cds-XP_623975.1;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XP_623975.1,BEEBASE:GB42195;Name=XP_623975.1;gbkey=CDS;gene=LOC551580;product=ubiquitin-related modifier 1;protein_id=XP_623975.1
NC_037638.1	Gnomon	CDS	9478	9546	.	-	0	ID=cds-XP_623975.1;Parent=rna-XM_623972.6;Dbxref=GeneID:551580,Genbank:XP_623975.1,BEEBASE:GB42195;Name=XP_623975.1;gbkey=CDS;gene=LOC551580;product=ubiquitin-related modifier 1;protein_id=XP_623975.1

@estolle
Copy link
Author

estolle commented Oct 30, 2020

Thanks for the fix with the Stop Codon . That should already fix the immediate problem! :) :)

Yeah, I am not sure what kind of annotations all these things are. Some are labelled lncRNA or such. I tought they may be different isoforms in some cases.

The problem with theparents I managed to fix using AGAT (on github):
agat_convert_sp_gxf2gxf.pl

And I am using gffread to exclude noncoding and pseudo and grep to exclude miRNA and "cDNA matches". I'll loose some of the annotations but the majority (proteincoding) should stay in I guess.

@nextgenusfs
Copy link
Owner

Ideally we'd be able to keep all of the features, even if funannotate can't do anything with them, would be best if they could be passed along. One way would be just to split the lines of the GFF that it can't digest off into another file, the problem then becomes incorporating it back in, since the core of funannotate is not GFF but rather NCBI tbl format.

Since apparently its okay by NCBI standards to have a gene have multiple different types, I should probably make that change -- essentially its converting a key to a list instead of a single value, its just that that value is referenced lots of times in the code. But changing that might allow for this type of GFF to be processed directly by funannotate.

@estolle
Copy link
Author

estolle commented Oct 30, 2020

Thanks for the fixes, it progressed alot firther (99.8%), but I am still getting errors looking similar as before with phobius. I checked and there are internal stop codons in the protein sequence.
While these actually should be present, in this species alot of them come from differences between the genome sequence and the annotation for which there was transcript/protein evidence. The GFF comments describe this here and there that the transcripts look ok, but the genome sequence (of that individual) had one or more stop codons present.

What would you suggest to do here? I think there are some GFF tools which can filter them out, or, for the phobius or other annotation step the internal stop codons could be changed to X.

@estolle
Copy link
Author

estolle commented Oct 30, 2020

the input GFF was made like this (to make it EVM conform):

gffread -C --no-pseudo -g $INPUTFILENAME.fa $INPUTFILENAME.gtf -o $INPUTFILENAME.gffread.gff

cat $INPUTFILENAME.gffread.gff | awk ' $3 == "transcript" ' | cut -f2 | sort | uniq
cat $INPUTFILENAME.gffread.gff | grep -vP 'cDNA_match|NC_001566.1|miRBase:' | sed "s/Parent=gene-Apd-3/Parent=gene-apd-3/g" | sed "s/Gnomon\ttranscript/Gnomon\tmRNA/g" | sed "s/BestRefSeq\ttranscript/Gnomon\tmRNA/g" > $INPUTFILENAME.tmp.gff

agat_convert_sp_gxf2gxf.pl -v 0 --gff $INPUTFILENAME.tmp.gff --output $INPUTFILENAME.fixed.gff

gt gff3validator $INPUTFILENAME.fixed.gff
export EVM_BASE_DIR=/opt/EVidenceModeler-1.1.1
$EVM_BASE_DIR/EvmUtils/gff3_gene_prediction_file_validator.pl $INPUTFILENAME.fixed.gff

@estolle
Copy link
Author

estolle commented Oct 31, 2020

I used the -V flag in gffread to remove the CDS's containing intenral stop codons and this works now. Idealy I would however want all the original annotation to be passed through. The main goal is to get a valid gbk file for multiple species so that I can try using funannotate compare (but perhaps the intenral stop codons would cause problems there too ?)

@nextgenusfs
Copy link
Owner

I'm away from computer this weekend, but yeah a CDS with internal stops in my view is an incorrect annotation. Is it possible they are derived from non coding genes that are getting changed during the gff3 manipulation hoops?

@estolle
Copy link
Author

estolle commented Nov 24, 2020

I think some of those genemodels , at least according to the description on NCBI, have trnscript/RNA evidence, so perhaps in some populations (e.g. where the RNAseq comes from) its a valid transcript, and other populations (e.g .from which the reference sequence came from) have an internal stop codon. This is known for a few genes to more detail. Seem they are "one the way" to degenerate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants