Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funannotate train crash at Pasa step with iso-seq + RNA-seq #326

Closed
eyalbenda opened this issue Sep 23, 2019 · 25 comments
Closed

Funannotate train crash at Pasa step with iso-seq + RNA-seq #326

eyalbenda opened this issue Sep 23, 2019 · 25 comments

Comments

@eyalbenda
Copy link

I've installed the latest funannotate docker image (changed manually the path to 1.6.0 in the installation instructions).
Am running the following command to assemble a genome using pacbio iso-seq and single end RNA seq as evidence:

funannotate train -i Dba.dovetail.masked.fasta -s *.fastq.gz --pacbio_isoseq DbaIsoSeq.fasta --species "Dunaliella" --cpus 2 --max_intronlen 30000 -o out

However, I'm getting a crash in Pasa with the following error:

Traceback (most recent call last):
File "/home/linuxbrew/funannotate/bin/funannotate-train.py", line 967, in
runPASAtrain(genome, trinity_transcripts, cleanTranscripts, os.path.abspath(trinityGFF3), stringtieGTF, args.stranded, args.max_intronlen, args.cpus, organism_name, PASA_tmp)
File "/home/linuxbrew/funannotate/bin/funannotate-train.py", line 405, in runPASAtrain
with open(os.path.join(folder, pasaDBname+'.pasa_assemblies_described.txt'), 'rU') as description:
IOError: [Errno 2] No such file or directory: 'out/training/pasa/Dunaliella_bardawil.pasa_assemblies_described.txt'

@nextgenusfs
Copy link
Owner

Will be helpful to post the erroring part of the logfiles for train and for PASA.

@eyalbenda
Copy link
Author

Of course, see attachment.
funannotate-train.log

@nextgenusfs
Copy link
Owner

Looks like maybe transcripts didn't get indexed, i.e.

Thread 2 terminated abnormally: Error, no seek pos for acc: ScySeS1_10 at /home/linuxbrew/PASApipeline/PerlLib/Fasta_retriever.pm line 100 thread 2.
	Fasta_retriever::get_seq(Fasta_retriever=HASH(0x55bded17ad50), "ScySeS1_10") called at /home/linuxbrew/PASApipeline/scripts/assemble_clusters.dbi line 166 thread 2
	main::assemble_transcripts_on_scaffold("ScySeS1_10") called at /home/linuxbrew/PASApipeline/scripts/assemble_clusters.dbi line 135 thread 2
	eval {...} called at /home/linuxbrew/PASApipeline/scripts/assemble_clusters.dbi line 135 thread 2

Is this only a problem when you use Illumina and PacBio?

@eyalbenda
Copy link
Author

I deleted the pasa directory and reran funannotate train, this time removing the iso-seq reads (don't know if that's enough, if not I can rerun the whole thing overnight):

funannotate train -i Dba.dovetail.masked.fasta -s *.fastq.gz --species "Dunaliella bardawil" --cpus 2 --max_intronlen 30000 -o out

The error file was too big, so I sent you a google-drive link by email

@nextgenusfs
Copy link
Owner

nextgenusfs commented Sep 23, 2019

Okay thanks. Will try to take a look after work. And one other follow-up, the RNA-seq tests run without error on your system? ie

funannotate test -t rna-seq --cpus 2

@eyalbenda
Copy link
Author

eyalbenda commented Sep 23, 2019 via email

@nextgenusfs
Copy link
Owner

The error in the log file seems to say an alignment exists but the sequence doesn't. This is probably due to re-running the command with different input, ie funannotate will re-use any data that is available, so if you ran with Illumina + PacBio (which failed) and then re-ran with only Illumina, it will still re-use the alignments if they are present.

Lets see if the only illumina works. The other thing to try would be just the PacBio -- it will help me to fix if I know that error only happens when you pass both RNA-seq data.

@nextgenusfs
Copy link
Owner

Any updates here? Did you find a solution that worked? I'm recalling something about Pacbio fasta headers that might causing some problems with PASA.

@eyalbenda
Copy link
Author

Thanks Jon, I ended up just using the Illumina data and dropping the iso-seq, since that seemed to work.
Please let me know if I should close the issue, or if there is anything I can try on my end, I'd be happy to help!

@nextgenusfs
Copy link
Owner

I'd like to figure out what was causing the error -- any chance you'd be able to sub sample the data to a smaller reproducible dataset that still causes the error? That is of course if you'd be willing to share this smaller test set with me confidentially so I can determine what the error is?

@eyalbenda
Copy link
Author

eyalbenda commented Nov 8, 2019 via email

@nextgenusfs
Copy link
Owner

I'll leave this open for now -- but would still like to hear if still an issue with the latest release. thanks!

@eyalbenda
Copy link
Author

eyalbenda commented Dec 16, 2019

Sorry for the late response. I asked but my collaborators are uncomfortable with sharing any data for debugging. I don't agree with their reluctance but it is what it is.
I tried annotating again and funannotate train crashes in assembling pasa clusters. This is the relevant logfiles - I tried manually the command from the pasa log and I'm parsing below the error output.

pasa /home/multivac/NanoSystem/anaconda37/envs/funannotate/opt/pasa-2.4.1/scripts/assembly_db_loader.dbi -M '/home/multivac/NanoSystem/Dunaliella/D.bardawil/fun/training/pasa/Dunaliella_bardawil' > pasa_run.log.dir/alignment_assembly_loading.out
-connecting to SQLite db: /home/multivac/NanoSystem/Dunaliella/D.bardawil/fun/training/pasa/Dunaliella_bardawil
WARNING: previous assemblies have been loaded...  Purging them first before attempting a re-load
-cleaning previous asmbls from alignment table
-cleaning previous asmbls from align_link table
-cleaning previous asmbls from cdna_info table
-purging asmbl_link
done cleaning.

Now, on to re-loading.

Committed 10000 PASA assemblies       DBD::SQLite::db do failed: UNIQUE constraint failed: asmbl_link.asmbl_acc, asmbl_link.cdna_acc at /home/multivac/NanoSystem/anaconda37/envs/funannotate/opt/pasa-2.4.1/PerlLib/DB_connect.pm line 221.
failed query: <insert into asmbl_link (asmbl_acc, cdna_acc) values (?,?)>       values: asmbl_10356 transcript
Errors: UNIQUE constraint failed: asmbl_link.asmbl_acc, asmbl_link.cdna_acc
 at /home/multivac/NanoSystem/anaconda37/envs/funannotate/opt/pasa-2.4.1/PerlLib/DB_connect.pm line 233.
        DB_connect::RunMod(DB_connect=HASH(0x559c56423ff8), "insert into asmbl_link (asmbl_acc, cdna_acc) values (?,?)", "asmbl_10356", "transcript") called at /home/multivac/NanoSystem/anaconda37/envs/funannotate/opt/pasa-2.4.1/scripts/assembly_db_loader.dbi line 153
Issuing rollback() due to DESTROY without explicit disconnect() of DBD::SQLite::db handle database=/home/multivac/NanoSystem/Dunaliella/D.bardawil/fun/training/pasa/Dunaliella_bardawil;host=localhost.

@eyalbenda
Copy link
Author

Another small issue, it's possible that because the output of the Isoseq QC pipeline is fasta, and because funannotate train is importing it as if it were fastq, I lose half the reads (at least it reports half the reads in the log files).

@nextgenusfs
Copy link
Owner

Looks like the same problem -- what does a grep of the sequence headers for asmbl_10356 result in? Looks like it is parsing the headers as asmbl_10356 is the asmbl_acc and transcript as the cdna_acc. So likely the transcript is causing the problem.

@eyalbenda
Copy link
Author

Thanks for the help. I've run it again without the Isoseq data but when I finish I'll go back and try to debug it looking at that transcript.

@eyalbenda
Copy link
Author

This is the output of alignment_assembly_loading.out for this cluster. I couldn't find any other reference to asmbl_16018 (the number has changed since I had to rerun it):

// assembly asmbl_16018 in cluster: 3358
blat.proc26359.chain_6280 transcript 2537 blat.proc26359.chain_6281 transcript 3505
orient(a-/s-) align: 7330984(1635)-7331986(633)>CT....AC<7332460(632)-7332505(587)>CT....AC<7332749(586)-7332816(519)>CT....AC<7333239(518)-7333293(464)>CT....AC<7333740(463)-7333815(388)>CT....AC<7333975(387)-7334038(324)>CT....AC<7334693(323)-7334747(269)>CT....AC<7335182(268)-7335260(190)>CT....AC<7337081(189)-7337136(134)>CT....AC<7337275(133)-7337407(1)

@eyalbenda
Copy link
Author

Don't know if this helps, but searching for these chains in the blat gff yields the following lines:

chr05 BLAT cDNA_match 7337275 7337407 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 1 133 +
chr05 BLAT cDNA_match 7337081 7337136 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 134 189 +
chr05 BLAT cDNA_match 7335182 7335260 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 190 268 +
chr05 BLAT cDNA_match 7334693 7334747 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 269 323 +
chr05 BLAT cDNA_match 7333975 7334038 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 324 387 +
chr05 BLAT cDNA_match 7333740 7333815 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 388 463 +
chr05 BLAT cDNA_match 7333239 7333293 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 464 518 +
chr05 BLAT cDNA_match 7332749 7332816 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 519 586 +
chr05 BLAT cDNA_match 7332460 7332505 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 587 632 +
chr05 BLAT cDNA_match 7330984 7331986 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 633 1635 +
chr05 BLAT cDNA_match 7337275 7337389 99.13 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 1 114 +
chr05 BLAT cDNA_match 7337081 7337136 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 115 170 +
chr05 BLAT cDNA_match 7335182 7335260 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 171 249 +
chr05 BLAT cDNA_match 7334693 7334747 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 250 304 +
chr05 BLAT cDNA_match 7333975 7334038 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 305 368 +
chr05 BLAT cDNA_match 7333740 7333815 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 369 444 +
chr05 BLAT cDNA_match 7333239 7333293 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 445 499 +
chr05 BLAT cDNA_match 7332749 7332816 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 500 567 +
chr05 BLAT cDNA_match 7332460 7332505 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 568 613 +
chr05 BLAT cDNA_match 7331145 7331986 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 614 1455 +

transcript/3505 (from __all_transcripts fasta):

transcript/3505
ATTATTTCCTTTTCAATATCTTTTCTGTTAGGTTTACATCCAGAACCACAACACGGCAAG
ATGACGCAGCAACCAGCAAACGAAATGGCCATGGACCGCAACGCCTACACCAAGGCCATG
TTTGGAGCTGACAAAGCCTTTGGTACACTTGTGGCTGCGGACAAGAAGCGGGTTGCAGAC
AAGATGGTGGAGGACGAGATCAAGTCTGCTGGCTCCCTTGTCCCAGGATGTGAAGCTGCA
CTCGCCAAGGCGGCAGCGGATGCCAGCTTGGAGGCCAAGGAAAGGGACGAAAACTTTGAC
AGGGATGACTACTCCAAGCGGATTTTTGGACTGCCCTTCAAGCAGCTGTCTGCTGAGCAC
AAGGAGCGTGTGGTCAATCAGATGATGATGGACGAGATGCATGCTGGTGGCCACGTGATT
TCAGAGGAGACAGCGCAAGCAAAGGCCATAGCTGATGCCGAATACAAAGCGAGAAACATG
CACGCGCCGCCTCCTTCAGGTCATTTCATGGTTGGTGGGCATGATCTCCGGAGCCCTGCA
ACTCCTGGAGGAACAGCTGAAGCCAAGGGGGCTTCTGAGGAGGAGAAGAACAAGCAAAGC
ACACCGCAAGCAGGGCACATGTTTGTCGCTGGGCACGATCTCAGGACCCCAGGCTCAGGA
GGCGCAGGTGTCAGACCTGACGAGCTGAACGAAAAAGAGATGAATAAAGCTTTCTAGATC
TGGCTTGTTGGTAAATTGGTGAAGGGGTGCACCTTTGCCTGCGCTATGTGCTTTGGCTAA
CATGTGGTGAGGGGTCTGGCGTTTACGAGGAAGGCCAAGAAGGACAGCCTTCAATCTGTG
CATTCTTGATTCACTTCTTGCATGTTTGCATTCTGAGTGCACAATTTCAGCACTCTTGTA
CATTCAGTCAATACATGCTTTGGCTAGCAGGTGGAAGTATGCAACCAGTGCACTTGAAGA
CAAATTCCAGTTAAGCAATTTGACGTACAAGGAAATCCAAAGAGAAGAGCCTTTGAATTT
TGCACTCTTGATTTACTTCTTGCATGTTTGCATTTCATGCAAATTTGGATCACCCTCTTG
TATATGACTTTGGTATGTGCTTTGGCTAGCAGGTGGGAGGATGCAAGGAATGAGTGTGTT
TACGGTCCCACTGCCTTGTTGAGAGGAATAGGCCATCAGCGGCTGCCATTGGTCCCACTG
CCTTGTTGAGAGGAATAGGCCATCAGCGGCTGCCATTGGGTGTGGAAATCAATAATTTTC
TGTGGACGCATCAGGGCCATTTTTGTAGAGGTGCAAGGCAACCAGTATACTTGAAGACAT
GTACCAAATAAGCAAGCATTTGATCCTTTTACAGATCTGAAACAAGTCTAAGGGCTTCTG
CATTTTTACTTCTTTCATTTTGCATCTCAAACGCAAAATCCCAGCAACCTCTTCAACATT
AAGTCAATGTTTGGC

transcript/2537 is the following:

transcript/2537
AAAGATAATTGAACAGGTATTATTTCCTTTTCAATATTCTTTTCTGTTAGGTTTACATCC
AGAACCACAACACGGCAAGATGACGCAGCAACCAGCAAACGAAATGGCCATGGACCGCAA
CGCCTACACCAAGGCCATGTTTGGAGCTGACAAAGCCTTTGGTACACTTGTGGCTGCGGA
CAAGAAGCGGGTTGCAGACAAGATGGTGGAGGACGAGATCAAGTCTGCTGGCTCCCTTGT
CCCAGGATGTGAAGCTGCACTCGCCAAGGCGGCAGCGGATGCCAGCTTGGAGGCCAAGGA
AAGGGACGAAAACTTTGACAGGGATGACTACTCCAAGCGGATTTTTGGACTGCCCTTCAA
GCAGCTGTCTGCTGAGCACAAGGAGCGTGTGGTCAATCAGATGATGATGGACGAGATGCA
TGCTGGTGGCCACGTGATTTCAGAGGAGACAGCGCAAGCAAAGGCCATAGCTGATGCCGA
ATACAAAGCGAGAAACATGCACGCGCCGCCTCCTTCAGGTCATTTCATGGTTGGTGGGCA
TGATCTCCGGAGCCCTGCAACTCCTGGAGGAACAGCTGAAGCCAAGGGGGCTTCTGAGGA
GGAGAAGAACAAGCAAAGCACACCGCAAGCAGGGCACATGTTTGTCGCTGGGCACGATCT
CAGGACCCCAGGCTCAGGAGGCGCAGGTGTCAGACCTGACGAGCTGAACGAAAAAGAGAT
GAATAAAGCTTTCTAGATCTGGCTTGTTGGTAAATTGGTGAAGGGGTGCACCTTTGCCTG
CGCTATGTGCTTTGGCTAACATGTGGTGAGGGGTCTGGCGTTTACGAGGAAGGCCAAGAA
GGACAGCCTTCAATCTGTGCATTCTTGATTCACTTCTTGCATGTTTGCATTCTGAGTGCA
CAATTTCAGCACTCTTGTACATTCAGTCAATACATGCTTTGGCTAGCAGGTGGAAGTATG
CAACCAGTGCACTTGAAGACAAATTCCAGTTAAGCAATTTGACGTACAAGGAAATCCAAA
GAGAAGAGCCTTTGAATTTTGCACTCTTGATTTACTTCTTGCATGTTTGCATTTCATGCA
AATTTGGATCACCCTCTTGTATATGACTTTGGTATGTGCTTTGGCTAGCAGGTGGGAGGA
TGCAAGGAATGAGTGTGTTTACGGTCCCACTGCCTTGTTGAGAGGAATAGGCCATCAGCG
GCTGCCATTGGTCCCACTGCCTTGTTGAGAGGAATAGGCCATCAGCGGCTGCCATTGGGT
GTGGAAATCAATAATTTTCTGTGGACGCATCAGGGCCATTTTTGTAGAGGTGCAAGGCAA
CCAGTATACTTGAAGACATGTACCAAATAAGCAAGCATTTGATCCTTTTACAGATCTGAA
ACAAGTCTAAGGGCTTCTGCATTTTTACTTCTTTCATTTTGCATCTCAAACGCAAAATCC
CAGCAACCTCTTCAACATTAAGTCAATGTTTGGCACTTGCGCGATGGATCAAGCTCCCAT
GACATGAGCGGCGAAGCCAGCATCCAATTACAAGGAAGCAGGATGTGTGGTGTTTCAATT
CAGATTGGCAATTCAGTTTGAAAGAGCTGAAGCACCTCATGCTCTTTCAAACAACCTGTA
ACAGTGGAAAATGAG

@nextgenusfs
Copy link
Owner

Are there forward slashes in the fasta header?? ie transcript/3505? That could certainly be causing a problem somewhere.

@eyalbenda
Copy link
Author

yes, based on __all_transcripts fasta there are 97 transcripts with a forward slash in them. Running the training pipeline without the iso-seq data doesn't result in any transcripts with that naming convention.

@nextgenusfs
Copy link
Owner

I would just rename the isoseq data with something simple and it will likely work.

@eyalbenda
Copy link
Author

eyalbenda commented Dec 24, 2019 via email

@nextgenusfs
Copy link
Owner

Yeah if that fixes it then I can just add a fasta header rename step. There is one already but it must not be working properly.

@eyalbenda
Copy link
Author

After replacing the forward slashes with underscores in the Iso-seq fastq file, funannotate train completes successfully. Thank you for your support, closing the issue.

@nextgenusfs
Copy link
Owner

Okay, added this via 1ee07cb. Also updated the parser to something simpler, so should be a little faster in processing the long reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants