-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
another type of error #47
Comments
Hi @ospfsg, interesting. What exactly do you mean by
Without any error message, the terminal just closes? That does not seem like anything that would happen from running snakemake or grenepipe, so there must be something more weird going on. Could you somehow maybe give me some steps for how I could reproduce this behavior, or describe what is happening in more detail? Or even better, use a screen recording tool to get a video clip of what is happening? From your log files, there is nothing that explains the behavior that you describe. I am seeing
in there. Is that name something that shows up anywhere in your data, as a sample name or chromosome name, or somewhere else? Apart from that, there are no clues that would help right now. I am traveling and won't be able to investigate this issue for another week. I'll then try to reproduce with the freebayes settings that you posted above. Let's see if I can recreate the issue as well. If not, and in the meantime: How big is your data set? Could you maybe prepare a minimal subset of your data that causes the issue, and that I can use to investigate? It seems like a rather involved and complicated issue, if the terminal just closes. I have never seen that, and tools such as freebayes, snakemake (and grenepipe) should usually not do that - hence, I have not really a good idea what is going on there. So long, let's see what we can figure out |
Hi I just run 1/4 of the poolseq dataset and everything went smoothly in less than 24 hours, generating 65Million SNPs! So maybe is my workstation (128 threads and 512GB RAM) that doesn't cope with the full dataset. I am going to try with increase amount to check how far I can go. osp |
Hi @ospfsg, how are things going? Did the pipeline work now with a reduced data set? From what you described, I can only guess what has happened - but if it works with smaller data, and fails with larger, then it is rather likely that this is some issue with memory or your workstation, instead of something related to grenepipe itself. Let me know what you think, and if we can close this issue then. Cheers and so long |
Hi Lucas The problem seems not to be the amount of data but the ploidy that is used in freebayes. I am currently running with ploidy of 4 in spite that in the pool I have 40 samples. So the problem is not in your pipeline, the problem is the capacity of my workstation dealing with a freebayes -p 40.. It copes until ploidy of 30 but apparently not more than that!!! Thank you |
Hi @ospfsg, ah okay. So, it seems you want to use freebayes for variant calling on pooled data? In that case, I'd recommend to have a look at our preprint where we show that this can lead to issues and biases down the line. That's because the typical variant callers (all three that grenepipe offers) are not meant for pooled data - and hence their statistical assumptions for detecting what is a variant (in individuals) not applicable for pools. That's also why these tools are not optimized for the artificially high ploidy that you are using there. Nonetheless, this type of variant calling on pooled data seems to be common practice. I'd however advise against it. Instead, what we found has been working well, and is also way faster, is to just map the reads to the reference genome you are using. This can for example be done by only running the mapping part of the pipeline, see here. We are also currently in the very last steps of publishing our tool for working with this kind of data, called grenedalf. If you want to run some typical downstream analysis, this might help. If grenedalf is lacking some functionality that you need, let me know. However, wait a few more days before using it - by the end of the week (hopefully) I'll have a new version out, in which many things are improved and corrected compared to the current v0.3.0. As for your question:
Variant calling is parallelized over chromosomes. It however by its very nature has to be run on all samples at the same time, so there is unfortunately not much more than can be parallelized easily. Hope that helps, and so long |
Hi Lucas Thank very much for your advice. I will be happy to test grenedalf. In your prepprint you have an interesting paragraph : These analyses resulted in prohibitively long runtimes even in cluster environments (GATK HaplotypeCaller) and large memory usage (freebayes), demonstrating these tools’ limited capabilities for analyzing large datasets and large pool sizes. I assume that the second sentence is about your experiment 2, but did you test the impact on the results of a "fake" low ploidy on the results ? I used the .bam files after picard the ones in the dedup folder, that seems to me be the input for the snpcallers. Can I used then for the input grenedalf? Thank you |
Hi Octavio,
Yes, if I recall correctly, that's what we did. See supplementary figures S6 onward in the manuscript :-)
Yes you can! It seemed rather wasteful and annoying to me to have to convert everything to pileup files first, so in grenedalf, I implemented to read directly from sam/bam files ;-) Cheers |
Hi Lucas is it possible to run grenepipe with the starting point of bam files (from dedup) and run the different snpcallers instead of starting the all process from the beginning each time we want to use a different caller ? best |
Hi @ospfsg, good question, and you are not the only one with a request like that. Unfortunately, Snakemake itself is not meant for these things where changes from the outside are needed in the middle of a workflow run. And within grenepipe, I've had users who wanted similar things of "start from there", "re-use these files", etc, but always in a slightly different way, which means, I cannot easily integrate that as a feature, because it would need to be so flexible to accommodate for all these use cases that it would be both hard to implement and hard to use... So, long story short, the current solution for use cases like that is to trick Snakemake itself into doing what you want. Depending on how you wanna do this, this can be a bit tricky, but possible. The straight forward approach is as follows: Run the pipeline with a variant caller. Then, move the directories that contain all variant-caller-related files to somewhere else. These directories are: Couple of tips for this approach: Before running Snakemake again for the other callers, I'd check first with the extra flags The only downside is that you can only run the pipeline with one variant caller (or one config setting in general) at a time. For the three callers implemented, I think this is good enough. If you'd need to scale this up, to test many settings of each caller, etc, a more scalable approach would be needed, where you can run everything at once, and don't need to manually copy directories each time. If you need that instead, let me know. Hope that helps, and so long |
Hi
After editing the reference genome file, as you suggested, it worked for individual sample files but there is still a problem with PoolSeq data . It crashed and the terminal closed by himself. The problem seems to be during the snpcalling
This is my config part for freebayes
I attached the log file
2024-04-09T151509.566084.snakemake.log
and the two log files that seems to be relevant
log: logs/freebayes/JAVFWQ010000033.1.log
JAVFWQ010000033.1.log
log: logs/freebayes/chromosome_1.log
chromosome_1.log
thank you again
osp
The text was updated successfully, but these errors were encountered: