Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing FinalTrainingModels missing #591

Closed
fbemm opened this issue May 6, 2021 · 12 comments
Closed

Missing FinalTrainingModels missing #591

fbemm opened this issue May 6, 2021 · 12 comments

Comments

@fbemm
Copy link

fbemm commented May 6, 2021

Hi, I am running the following:

  1. funannoiate train w/ RNA-seq
  2. funannotate predict w/ pre-trained Augustus & GlimmerHMM/SNAP trained by PASA

What happens:

FinalTrainingModels respectively final_training_models.gff3 is never found. Once provided with "a" PASA GFF, GlimmerHMM/SNAP are working.

For me it looks like, the FinalTrainingModels are never created from PASA results if Augustus is running pre-trained.

Is this still related to #427?

Bests!

@fbemm
Copy link
Author

fbemm commented May 6, 2021

Files present:

pasa_predictions.gff3

Files missing:

pasa.training.tmp.average_gene_length.out
pasa.training.tmp.c.gtf
pasa.training.tmp.f.bad.gtf
pasa.training.tmp.f.good.gtf
pasa.training.tmp.gtf
final_training_models.gff3

@nextgenusfs
Copy link
Owner

Okay, what version of funannotate? I thought #427 was fixed, but I'm going to need more to go on then just this in order to reproduce and/or find the bug. It would help to see the commands you've run.

Can you confirm that BUSCO does not get run?

@fbemm
Copy link
Author

fbemm commented May 7, 2021

Sorry for the missing details!

  • BUSCO never runs
  • Train completes successful and predict also recognizes all of trains output files
  • Version is 1.8.7

Train command was:

funannotate train -i genome_renamed.fasta -o train --left_norm norm_R1.fq --stranded FR --right_norm norm_R2.fq --max_intronlen 27000 --cpus 64

Predioct command was:

funannotate predict -i genome_renamed.fasta -o train -s plant --busco_db eukaryota --organism other --repeats2evm --cpus 64 --max_intronlen 27000

@fbemm
Copy link
Author

fbemm commented May 7, 2021

Train output looks complete:

err_seqcl_trinity.fasta.log
funannotate_train.coordSorted.bam
funannotate_train.pasa.gff3
funannotate_train.stringtie.gtf
funannotate_train.stringtie.gtf.fasta
funannotate_train.stringtie.gtf.gff3
funannotate_train.transcripts.gff3
funannotate_train.trinity-GG.fasta
genome.fasta
genome.fasta.cidx
genome.fasta.fai
getBestModel
hisat2.coordSorted.bam
hisat2.genome.1.ht2
hisat2.genome.2.ht2
hisat2.genome.3.ht2
hisat2.genome.4.ht2
hisat2.genome.5.ht2
hisat2.genome.6.ht2
hisat2.genome.7.ht2
hisat2.genome.8.ht2
kallisto.tsv
normalize
outparts_cln.sort
pasa
pasa.step1.gff3
seqcl_trinity.fasta.log
transcript.alignments.bam
transcript.alignments.gff3
trinity.alignments.bam
trinity.alignments.gff3
trinity.fasta
trinity.fasta.cidx
trinity.fasta.clean
trinity.fasta.clean.fai
trinity.fasta.cln
trinity_gg
Trinity-gg.log

Predict misc contains the raw PASA output:
pasa_predictions.gff3

@nextgenusfs
Copy link
Owner

Okay thanks. I think I found the bug. I guess on a related note, I'm not sure why you would want to use pre-trained augustus if you have RNA-seq/PASA data.... Augustus is the most sensitive to training so typically I see better performance when using the PASA dataset.

Oh I see you updated comment, so you must have had a pre-trained species in augustus called 'plant'. Perhaps that is what you intended to do, if that's not what you wanted, then add perhaps a full genus species name to --species and/or add --isolate or --strain which will then create a unique name to train based on the existing data.

@fbemm
Copy link
Author

fbemm commented May 7, 2021

[05/05/21 10:20:34]: OS: Ubuntu 20.04, 56 cores, ~ 395 GB RAM. Python: 3.7.8
[05/05/21 10:20:34]: Running funannotate v1.8.7
[05/05/21 10:20:34]: GeneMark not found and $GENEMARK_PATH environmental variable missing. Will skip GeneMark ab-initio prediction.
[05/05/21 10:20:35]: Found training files, will re-use these files:
  --rna_bam train/training/funannotate_train.coordSorted.bam
  --stringtie train/training/funannotate_train.stringtie.gtf
  --transcript_alignments train/training/funannotate_train.transcripts.gff3
[05/05/21 10:20:35]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1}
[05/05/21 10:20:36]: Skipping CodingQuarry as --organism=other. Pass a weight larger than 0 to run CQ, ie --weights codingquarry:1
[05/05/21 10:20:36]: {'augustus': 'pretrained', 'snap': 'pasa', 'glimmerhmm': 'pasa'}
[05/05/21 10:20:36]: Parsed training data, run ab-initio gene predictors as follows:
[05/05/21 10:20:38]: perl /opt/micromamba/opt/evidencemodeler-1.1.1/EvmUtils/gff3_gene_prediction_file_validator.pl /mnt/scratch/vilperte/genome_annotation/plant/train/predict_misc/pasa_predictions.gff3
[05/05/21 10:20:41]: {'augustus': 1, 'hiq': 2, 'genemark': 0, 'pasa': 6, 'codingquarry': 0, 'snap': 1, 'glimmerhmm': 1, 'proteins': 1, 'transcripts': 1}
[05/05/21 10:22:46]: Loading genome assembly and parsing soft-masked repetitive sequences
[05/05/21 10:23:20]: Genome loaded: 309 scaffolds; 601,304,787 bp; 37.39% repeats masked
[05/05/21 10:23:27]: Parsed 123,241 transcript alignments from: train/training/funannotate_train.transcripts.gff3
[05/05/21 10:23:27]: Creating transcript EVM alignments and Augustus transcripts hintsfile
[05/05/21 10:23:32]: Existing RNA-seq BAM hints found: train/predict_misc/hints.BAM.gff
[05/05/21 13:25:08]: join_mult_hints.pl
[05/05/21 13:25:09]: Running Augustus gene prediction using zr parameters
[05/05/21 15:18:12]: perl /opt/micromamba/opt/evidencemodeler-1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl train/predict_misc/augustus.gff3
[05/05/21 15:18:28]: Pulling out high quality Augustus predictions
[05/05/21 15:18:31]: Found 11,710 high quality predictions from Augustus (>90% exon evidence)
[05/05/21 15:18:33]: Snap training failed, empty training set: train/predict_misc/final_training_models.gff3
[05/05/21 15:18:33]: snap failed removing from training parameters
[05/05/21 15:18:33]: GlimmerHMM training failed, empty training set: train/predict_misc/final_training_models.gff3
[05/05/21 15:18:33]: GlimmerHMM failed, removing from training parameters
[05/05/21 15:18:36]: Prediction sources: ['Augustus', 'HiQ', 'pasa']
[05/05/21 15:18:38]: Summary of gene models: {'total': 118450, 'Augustus': 72919, 'HiQ': 13714, 'pasa': 31817}
[05/05/21 15:18:38]: EVM Weights: {'Augustus': 1, 'HiQ': 2, 'pasa': 6, 'proteins': 1, 'transcripts': 1}
[05/05/21 15:18:38]: Summary of gene models passed to EVM (weights):
[05/05/21 15:18:38]: Launching EVM via funannotate-runEVM.py

@fbemm
Copy link
Author

fbemm commented May 7, 2021

Yes, Augustus get's passed the "plant" model, which is a highly accurate, deeply trained & evaluated model that we use.

@nextgenusfs
Copy link
Owner

You can try out latest in master by installing with pip, should be able to issue the same command and see if the issue is fixed. Upgrade with pip from the conda/mamba env:

python -m pip install git+https://github.com/nextgenusfs/funannotate.git --upgrade --no-deps

@fbemm
Copy link
Author

fbemm commented May 7, 2021

Will do asap!

@nextgenusfs
Copy link
Owner

Okay, I'd probably write the command like this then as you'll generate glimmer/snap training data from these PASA models for that particular genome.

funannotate predict -i genome_renamed.fasta -o train -s "Genus species" --augustus_species plant \
    --busco_db eukaryota --organism other --repeats2evm --cpus 64 --max_intronlen 27000

@fbemm
Copy link
Author

fbemm commented May 7, 2021

[May 07 04:17 PM]: Filtering PASA data for suitable training set
[May 07 04:20 PM]: 5,147 of 31,182 models pass training parameters

Seems to work. Also created now:

pasa.training.tmp.average_gene_length.out
pasa.training.tmp.c.gtf
pasa.training.tmp.f.bad.gtf
pasa.training.tmp.f.good.gtf
pasa.training.tmp.gtf

Final confirmation on Monday! Thanks for the fix.

Have a nice one ...

@nextgenusfs
Copy link
Owner

Hi @fbemm, assuming this is working now? Re-open if still an issue. Will be incorporated into next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants