Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnosing pipeline failure #12

Open
jsoghigian opened this issue Nov 14, 2023 · 11 comments
Open

Diagnosing pipeline failure #12

jsoghigian opened this issue Nov 14, 2023 · 11 comments

Comments

@jsoghigian
Copy link

Hi there!

Thanks again for the great tool. I've been trying it on my personal data and I think it is finishing the OrthoFinder step now. I am not sure exactly where it fails, but I think it may be failing to complete aligning single copy genes. On the macse alignment step I noticed the error log had a number of messages like this:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at align.ProfileAligner.<init>(ProfileAligner.java:120) at align.CodingMSA.<init>(CodingMSA.java:64) at align.CodingMSA.buildAlignmentReliable(CodingMSA.java:633) at align.CodingMSA.run(CodingMSA.java:659) at utils.MacseMain.main(MacseMain.java:426)

I have tried various configurations for memory and processors (last run was 3 processors + 120 gb of memory)...

This is a set of 5 insect genomes...

@jeffersonfparil
Copy link
Owner

Yes, it's failing to align with MACSE; however I have had no problems aligning with 32 cores and ~160GB of RAM - possibly some of your sequences are very large. Your options are:

@jsoghigian
Copy link
Author

jsoghigian commented Nov 16, 2023

Great suggestions - I did what you said, and notably I updated the MACSE wrapper to have more memory. That got rid of the heap space error, but now I have this in my error log:

42422 sequence(s) with genetic code The_Standard_Code Build draft alignment using greedy strategy based on UPGMA tree Start building smaller alignments 4242 fasta file:OG0006748.aligned.unsorted.cds.tmp not found java.io.FileNotFoundException: OG0006748.aligned.unsorted.cds.tmp (No such file or directory) at java.base/java.io.FileInputStream.open0(Native Method) at java.base/java.io.FileInputStream.open(FileInputStream.java:216) at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157) at java.base/java.io.FileInputStream.<init>(FileInputStream.java:111) at java.base/java.io.FileReader.<init>(FileReader.java:60) at bioObject.CodingDnaSeq.readFasta(CodingDnaSeq.java:562) at utils.MacseMain.main(MacseMain.java:590) rm: cannot remove 'OG0006748*.tmp': No such file or directory rm: cannot remove 'OG0006748.AA.prot': No such file or directory

Is that still likely related to running out of memory?

@jeffersonfparil
Copy link
Owner

I think it has something to do with the pipeline reruns. The module single_gene_orthogroups_tree.nf cleans up the extracted sequences for alignment and other temporary files. Can you try rerunning the entire single_gene_orthogroups_tree.nf module again?

@jsoghigian
Copy link
Author

So I tried various things and kept running into the same error... I think the problem is my input file formats. The files from RefSeq have pipes which I think may be causing problems! I am going to see about modifying those input files and trying again.

@jeffersonfparil
Copy link
Owner

The pipes should not be a problem as I am accounting for those; but I may be completely wrong and some steps are failing because of them. Have you also tried completely restarting the pipeline from scratch (good 'ol turn it off and on again)?

@jsoghigian
Copy link
Author

I did try to restart the pipeline. I noticed A LOT of the files that are pulled out by single_gene_orthogroups_tree.nf are 80 mb with 40k sequences in them! We've run some of these genomes through orthofinder before and never had such large gene families found...

I can try manipulating headers tomorrow and seeing...

@jeffersonfparil
Copy link
Owner

I wonder if those are because of my sequence extractor scripts or OrthoFinder's orthogroup detection.

Thanks for your tenacity on this. Please let me know if it still fails again. Also, if you can share your config files so that I can recreate the errors and we'll debug further.

@jsoghigian
Copy link
Author

jsoghigian commented Nov 23, 2023

Okay! It was the headers. RefSeq headers for a couple of my species looked like this:

lcl|NC_004818.2_prot_XP_311158.5_1 [gene=KIBRLG] [locus_tag=AgaP_AGAP000002] [db_xref=VectorBase:AGAP000002-PA,GeneID:1272272] [protein=AGAP000002-PA] [protein_id=XP_311158.5] [location=complement(join(582..865,950..3120,3211..3370,3459..3760,15747..15871))] [gbkey=CDS]

I found an alternative repository with different headers, which looked like this:

AGAP000002-PA | transcript=AGAP000002-RA | gene=AGAP000002 | organism=Anopheles_gambiae_PEST | gene_product=unspecified product | transcript_product=unspecified product | location=AgamP4_X:582-15871(-) | protein_length=1013 | sequence_SO=chromosome | SO=protein_coding_gene | is_pseudo=false

After this, masce aligned the orthologs without issue

@jeffersonfparil
Copy link
Owner

Nice! Then I think it's probably an issue with my sequence extractor (Called here this julia script here). I'll look at them when I have time, but feel free to submit pull requests if you have time to fix them.

@jsoghigian
Copy link
Author

jsoghigian commented Nov 27, 2023

I can take a look! I almost wonder if a "translation table" would be a better optional approach for datasets like mine, where the headers are kind of a mess, haha.

Regardless I was able to get things to run quite well, and I'm excited to try it on a larger dataset!

Question for you - what kind of filtering do you do of inputs before you run the pipeline? I realized the gene sets I downloaded did not do any isoform curation which resulted in some hilarious levels of purported gene duplication (the gene sets from our genomes were pre-curated for isoforms, I just didn't think to do that to this new set when I was testing header issues).

@jeffersonfparil
Copy link
Owner

Awesome! We have not had problems with isoforms messing with our gene duplication estimates. BUSCO analyses seem to be within expected ranges for the plant species we've had to deal with so far. And I'm not too intimately knowledgeable about the annotation pipeline which may or may not deal with isoforms well enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants