Missing FinalTrainingModels missing #591

fbemm · 2021-05-06T10:22:43Z

Hi, I am running the following:

funannoiate train w/ RNA-seq
funannotate predict w/ pre-trained Augustus & GlimmerHMM/SNAP trained by PASA

What happens:

FinalTrainingModels respectively final_training_models.gff3 is never found. Once provided with "a" PASA GFF, GlimmerHMM/SNAP are working.

For me it looks like, the FinalTrainingModels are never created from PASA results if Augustus is running pre-trained.

Is this still related to #427?

Bests!

The text was updated successfully, but these errors were encountered:

fbemm · 2021-05-06T10:24:27Z

Files present:

pasa_predictions.gff3

Files missing:

pasa.training.tmp.average_gene_length.out
pasa.training.tmp.c.gtf
pasa.training.tmp.f.bad.gtf
pasa.training.tmp.f.good.gtf
pasa.training.tmp.gtf
final_training_models.gff3

nextgenusfs · 2021-05-07T02:50:26Z

Okay, what version of funannotate? I thought #427 was fixed, but I'm going to need more to go on then just this in order to reproduce and/or find the bug. It would help to see the commands you've run.

Can you confirm that BUSCO does not get run?

fbemm · 2021-05-07T05:39:35Z

Sorry for the missing details!

BUSCO never runs
Train completes successful and predict also recognizes all of trains output files
Version is 1.8.7

Train command was:

funannotate train -i genome_renamed.fasta -o train --left_norm norm_R1.fq --stranded FR --right_norm norm_R2.fq --max_intronlen 27000 --cpus 64

Predioct command was:

funannotate predict -i genome_renamed.fasta -o train -s plant --busco_db eukaryota --organism other --repeats2evm --cpus 64 --max_intronlen 27000

fbemm · 2021-05-07T05:41:23Z

Train output looks complete:

err_seqcl_trinity.fasta.log
funannotate_train.coordSorted.bam
funannotate_train.pasa.gff3
funannotate_train.stringtie.gtf
funannotate_train.stringtie.gtf.fasta
funannotate_train.stringtie.gtf.gff3
funannotate_train.transcripts.gff3
funannotate_train.trinity-GG.fasta
genome.fasta
genome.fasta.cidx
genome.fasta.fai
getBestModel
hisat2.coordSorted.bam
hisat2.genome.1.ht2
hisat2.genome.2.ht2
hisat2.genome.3.ht2
hisat2.genome.4.ht2
hisat2.genome.5.ht2
hisat2.genome.6.ht2
hisat2.genome.7.ht2
hisat2.genome.8.ht2
kallisto.tsv
normalize
outparts_cln.sort
pasa
pasa.step1.gff3
seqcl_trinity.fasta.log
transcript.alignments.bam
transcript.alignments.gff3
trinity.alignments.bam
trinity.alignments.gff3
trinity.fasta
trinity.fasta.cidx
trinity.fasta.clean
trinity.fasta.clean.fai
trinity.fasta.cln
trinity_gg
Trinity-gg.log

Predict misc contains the raw PASA output:
pasa_predictions.gff3

nextgenusfs · 2021-05-07T05:43:37Z

Okay thanks. I think I found the bug. I guess on a related note, I'm not sure why you would want to use pre-trained augustus if you have RNA-seq/PASA data.... Augustus is the most sensitive to training so typically I see better performance when using the PASA dataset.

Oh I see you updated comment, so you must have had a pre-trained species in augustus called 'plant'. Perhaps that is what you intended to do, if that's not what you wanted, then add perhaps a full genus species name to --species and/or add --isolate or --strain which will then create a unique name to train based on the existing data.

fbemm · 2021-05-07T05:44:45Z

[05/05/21 10:20:34]: OS: Ubuntu 20.04, 56 cores, ~ 395 GB RAM. Python: 3.7.8
[05/05/21 10:20:34]: Running funannotate v1.8.7
[05/05/21 10:20:34]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction.
[05/05/21 10:20:35]: Found training files, will re-use these files:
  --rna_bam train/training/funannotate_train.coordSorted.bam
  --stringtie train/training/funannotate_train.stringtie.gtf
  --transcript_alignments train/training/funannotate_train.transcripts.gff3
[05/05/21 10:20:35]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1}
[05/05/21 10:20:36]: Skipping CodingQuarry as --organism=other. Pass a weight larger than 0 to run CQ, ie --weights codingquarry:1
[05/05/21 10:20:36]: {'augustus': 'pretrained', 'snap': 'pasa', 'glimmerhmm': 'pasa'}
[05/05/21 10:20:36]: Parsed training data, run ab-initio gene predictors as follows:
[05/05/21 10:20:38]: perl /opt/micromamba/opt/evidencemodeler-1.1.1/EvmUtils/gff3_gene_prediction_file_validator.pl /mnt/scratch/vilperte/genome_annotation/plant/train/predict_misc/pasa_predictions.gff3
[05/05/21 10:20:41]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1}
[05/05/21 10:22:46]: Loading genome assembly and parsing soft-masked repetitive sequences
[05/05/21 10:23:20]: Genome loaded: 309 scaffolds; 601,304,787 bp; 37.39% repeats masked
[05/05/21 10:23:27]: Parsed 123,241 transcript alignments from: train/training/funannotate_train.transcripts.gff3
[05/05/21 10:23:27]: Creating transcript EVM alignments and Augustus transcripts hintsfile
[05/05/21 10:23:32]: Existing RNA-seq BAM hints found: train/predict_misc/hints.BAM.gff
[05/05/21 13:25:08]: join_mult_hints.pl
[05/05/21 13:25:09]: Running Augustus gene prediction using zr parameters
[05/05/21 15:18:12]: perl /opt/micromamba/opt/evidencemodeler-1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl train/predict_misc/augustus.gff3
[05/05/21 15:18:28]: Pulling out high quality Augustus predictions
[05/05/21 15:18:31]: Found 11,710 high quality predictions from Augustus (>90% exon evidence)
[05/05/21 15:18:33]: Snap training failed, empty training set: train/predict_misc/final_training_models.gff3
[05/05/21 15:18:33]: snap failed removing from training parameters
[05/05/21 15:18:33]: GlimmerHMM training failed, empty training set: train/predict_misc/final_training_models.gff3
[05/05/21 15:18:33]: GlimmerHMM failed, removing from training parameters
[05/05/21 15:18:36]: Prediction sources: ['Augustus', 'HiQ', 'pasa']
[05/05/21 15:18:38]: Summary of gene models: {'total': 118450, 'Augustus': 72919, 'HiQ': 13714, 'pasa': 31817}
[05/05/21 15:18:38]: EVM Weights: {'Augustus': 1, 'HiQ': 2, 'pasa': 6, 'proteins': 1, 'transcripts': 1}
[05/05/21 15:18:38]: Summary of gene models passed to EVM (weights):
[05/05/21 15:18:38]: Launching EVM via funannotate-runEVM.py

fbemm · 2021-05-07T05:46:24Z

Yes, Augustus get's passed the "plant" model, which is a highly accurate, deeply trained & evaluated model that we use.

nextgenusfs · 2021-05-07T05:47:20Z

You can try out latest in master by installing with pip, should be able to issue the same command and see if the issue is fixed. Upgrade with pip from the conda/mamba env:

python -m pip install git+https://github.com/nextgenusfs/funannotate.git --upgrade --no-deps

fbemm · 2021-05-07T05:47:49Z

Will do asap!

nextgenusfs · 2021-05-07T05:49:45Z

Okay, I'd probably write the command like this then as you'll generate glimmer/snap training data from these PASA models for that particular genome.

funannotate predict -i genome_renamed.fasta -o train -s "Genus species" --augustus_species plant \
    --busco_db eukaryota --organism other --repeats2evm --cpus 64 --max_intronlen 27000

fbemm · 2021-05-07T15:43:39Z

[May 07 04:17 PM]: Filtering PASA data for suitable training set
[May 07 04:20 PM]: 5,147 of 31,182 models pass training parameters

Seems to work. Also created now:

pasa.training.tmp.average_gene_length.out
pasa.training.tmp.c.gtf
pasa.training.tmp.f.bad.gtf
pasa.training.tmp.f.good.gtf
pasa.training.tmp.gtf

Final confirmation on Monday! Thanks for the fix.

Have a nice one ...

nextgenusfs · 2021-05-15T16:42:33Z

Hi @fbemm, assuming this is working now? Re-open if still an issue. Will be incorporated into next release.

nextgenusfs pushed a commit that referenced this issue May 7, 2021

fix PASA training set availability for glimmer/snap/etc #591

9a66724

nextgenusfs closed this as completed May 15, 2021

wittetom mentioned this issue May 17, 2021

Cannot generate training set #597

Closed

nextgenusfs mentioned this issue Aug 30, 2021

empty training set: output/predict_misc/final_training_models.gff3 #629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing FinalTrainingModels missing #591

Missing FinalTrainingModels missing #591

fbemm commented May 6, 2021 •

edited

fbemm commented May 6, 2021

nextgenusfs commented May 7, 2021

fbemm commented May 7, 2021

fbemm commented May 7, 2021 •

edited

nextgenusfs commented May 7, 2021

fbemm commented May 7, 2021

fbemm commented May 7, 2021

nextgenusfs commented May 7, 2021

fbemm commented May 7, 2021

nextgenusfs commented May 7, 2021

fbemm commented May 7, 2021

nextgenusfs commented May 15, 2021

Missing FinalTrainingModels missing #591

Missing FinalTrainingModels missing #591

Comments

fbemm commented May 6, 2021 • edited

fbemm commented May 6, 2021

nextgenusfs commented May 7, 2021

fbemm commented May 7, 2021

fbemm commented May 7, 2021 • edited

nextgenusfs commented May 7, 2021

fbemm commented May 7, 2021

fbemm commented May 7, 2021

nextgenusfs commented May 7, 2021

fbemm commented May 7, 2021

nextgenusfs commented May 7, 2021

fbemm commented May 7, 2021

nextgenusfs commented May 15, 2021

fbemm commented May 6, 2021 •

edited

fbemm commented May 7, 2021 •

edited