New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
phobius fails with error at line 408 if funannotate annotate is used on FA+GFF #508
Comments
Yuck. GFF reformatting is never pretty. Did you try to pass the NCBI GFF3 file + FASTA directly into Per your fasta file above, maybe copy/paste error but there is no |
Yeah the fasta has a ">", seems this a copy/paste miss. The EVM validity check and funannotate die on the GFF because it 1.) has no Parents for basically all of the entries, i.e. it missed the "gene" field and has only the "mRNA" field. Also problematic were apparently funny entries like pseudogenes, miRNAs, and "cDNA matches". The file in question is this: related gbff/gtf/fna: I also tried converting the gtf (without success thus far - but I want to try AGAT on it) or gbff but any tool (external or funannotate util) die on them too. Was anybody ever able to parse any ncbi GFF file? Regarding the problem above I noticed that the original GFF contains UTRs but thats it. The Stop (and Start) codon are part of the CDS. Thats how they end up making problems here. I do not know a tool to remove them. AGAT offers to introduce a separate feature for start and stop but the CDS remains identical also if I convert to gtf formats (which should not contain the stop codons). |
Okay, I think should be easy to make a little Well you weren't joking -- these annotations are a hot mess. |
I seem to not be lucky with removing the StopCodon from those RefSeq GFFs, conversion from GTF with e.g. gffread add them. As I gathered, this is supposed to be the case with GFF. Perhaps if, during the translation to Protein, the translated Stop Codon (*) cold be removed if present. I tried to do this using Unix Tools, but if I restart funannotate annotate, it re-creates the Protein.fa from the Input FA+GFF |
The code should be removing the asterisks in the protein file as it also causes problems in interproscan. I will double check. |
If I relax the parser so that it doesn't just die if there is no Parent=, I'm still getting problems with some of these gene models, here is one that I don't think is correct -- this suggests that this gene has a protein coding transcript and then several non-coding transcripts. Funannotate encodes a gene model as a single type (tRNA, mRNA, rRNA, etc). When I'm annotating something, I would call the ncRNA's a different gene.... It could eventually be supported, but there are likely >50 areas in the code this could cause problems with.
|
Thanks for the fix with the Stop Codon . That should already fix the immediate problem! :) :) Yeah, I am not sure what kind of annotations all these things are. Some are labelled lncRNA or such. I tought they may be different isoforms in some cases. The problem with theparents I managed to fix using AGAT (on github): And I am using gffread to exclude noncoding and pseudo and grep to exclude miRNA and "cDNA matches". I'll loose some of the annotations but the majority (proteincoding) should stay in I guess. |
Ideally we'd be able to keep all of the features, even if funannotate can't do anything with them, would be best if they could be passed along. One way would be just to split the lines of the GFF that it can't digest off into another file, the problem then becomes incorporating it back in, since the core of funannotate is not GFF but rather NCBI tbl format. Since apparently its okay by NCBI standards to have a gene have multiple different types, I should probably make that change -- essentially its converting a key to a list instead of a single value, its just that that value is referenced lots of times in the code. But changing that might allow for this type of GFF to be processed directly by funannotate. |
Thanks for the fixes, it progressed alot firther (99.8%), but I am still getting errors looking similar as before with phobius. I checked and there are internal stop codons in the protein sequence. What would you suggest to do here? I think there are some GFF tools which can filter them out, or, for the phobius or other annotation step the internal stop codons could be changed to X. |
the input GFF was made like this (to make it EVM conform): gffread -C --no-pseudo -g $INPUTFILENAME.fa $INPUTFILENAME.gtf -o $INPUTFILENAME.gffread.gff cat $INPUTFILENAME.gffread.gff | awk ' $3 == "transcript" ' | cut -f2 | sort | uniq agat_convert_sp_gxf2gxf.pl -v 0 --gff $INPUTFILENAME.tmp.gff --output $INPUTFILENAME.fixed.gff gt gff3validator $INPUTFILENAME.fixed.gff |
I used the -V flag in gffread to remove the CDS's containing intenral stop codons and this works now. Idealy I would however want all the original annotation to be passed through. The main goal is to get a valid gbk file for multiple species so that I can try using funannotate compare (but perhaps the intenral stop codons would cause problems there too ?) |
I'm away from computer this weekend, but yeah a CDS with internal stops in my view is an incorrect annotation. Is it possible they are derived from non coding genes that are getting changed during the gff3 manipulation hoops? |
I think some of those genemodels , at least according to the description on NCBI, have trnscript/RNA evidence, so perhaps in some populations (e.g. where the RNAseq comes from) its a valid transcript, and other populations (e.g .from which the reference sequence came from) have an internal stop codon. This is known for a few genes to more detail. Seem they are "one the way" to degenerate |
Hi
I usually had no problems with the phobius step during annotation if I had run te predict and then annotation steps of funannotate
But now, in order to get a valid gbk file for existing NCBI RefSeq annotations, I tried to directly run only the annotate step with the FA and the GFF. I had to fix the gff from ncbi as EVM always fails to parse it (and other tools too). I did this using gffread and AGAT (see below). Most annotate steps finished but then it fails at phobius.
Investigating it seems this is caused by StopCodons being included in the protein fasta sequence (see below). If I remove the * (STOP codon) it runs. The GFF has start and stop codon entries for each mRNA so I guess its coming from there. I wonder if this is the expected format of the GFF (I am now trying to remove these codons and run again) or if a check could/should be implemented here.
phobius.pl -short rna-XM_016916622.2.fa
SEQENCE ID TM SP PREDICTION
Could not read provided fasta sequence at /usr/local/bin/phobius.pl line 408.
cat rna-XM_016916622.2.fa
My steps to format the GFF downloaded from NCBI for my insect genome:
#conda install -c bioconda agat
conda activate gff
Inputfile (prefix same for .fa and .gff), without file extension
INPUTFILENAME="GCF_003710045.1_USU_Nmel_1.2_genomic"
#exclude "cDNA-matches", MT scaffold "miRNAs", as it cannot be parse properly it seems
gffread -C --no-pseudo -o- -g $INPUTFILENAME.fa $INPUTFILENAME.gff | grep -vP 'cDNA_match|NC_001566.1|miRBase:' > $INPUTFILENAME.cleaned.gff
agat_convert_sp_gxf2gxf.pl -v 2 --gff $INPUTFILENAME.cleaned.gff > $INPUTFILENAME.cleaned.agat.out
#cut out AGAT text added to GFF
##cat $INPUTFILENAME.cleaned.agat.out | grep -n "GFF3 file parse"
LINENR=$(cat $INPUTFILENAME.cleaned.agat.out | grep -n "GFF3 file parse"|cut -d":" -f1)
TOTALLINES=$(cat $INPUTFILENAME.cleaned.agat.out | wc -l)
LINESEXTRACT=$(echo "$TOTALLINES-$LINENR" | bc -l)
#remove last info lines from AGAT and fix a Capitalization Typo for a specific gene
cat $INPUTFILENAME.cleaned.agat.out | tail -n $LINESEXTRACT | grep -v "usage: /home" | grep -v "Job done in" | sed "s/Parent=gene-Apd-3/Parent=gene-apd-3/g" > $INPUTFILENAME.fixed.gff
#check GFF
gt gff3validator $INPUTFILENAME.fixed.gff
export EVM_BASE_DIR=/opt/EVidenceModeler-1.1.1
$EVM_BASE_DIR/EvmUtils/gff3_gene_prediction_file_validator.pl $INPUTFILENAME.fixed.gff
rm -f $INPUTFILENAME.cleaned.gff $INPUTFILENAME.cleaned.agat.out $INPUTFILENAME.cleaned.gff
The text was updated successfully, but these errors were encountered: