Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

another type of error #47

Open
ospfsg opened this issue Apr 11, 2024 · 9 comments
Open

another type of error #47

ospfsg opened this issue Apr 11, 2024 · 9 comments

Comments

@ospfsg
Copy link

ospfsg commented Apr 11, 2024

Hi

After editing the reference genome file, as you suggested, it worked for individual sample files but there is still a problem with PoolSeq data . It crashed and the terminal closed by himself. The problem seems to be during the snpcalling

This is my config part for freebayes

  # ----------------------------------------------------------------------
  #     freebayes
  # ----------------------------------------------------------------------

  # Used only if settings:calling-tool == freebayes
  # See https://github.com/freebayes/freebayes
  freebayes:

    # Extra parameters for freebayes.
    extra: "-p 40 --use-best-n-alleles 4 --pooled-discrete"

    # Settings for parallelization
    threads: 120
    compress-threads: 2
    chunksize: 100000

I attached the log file

2024-04-09T151509.566084.snakemake.log

and the two log files that seems to be relevant

log: logs/freebayes/JAVFWQ010000033.1.log

JAVFWQ010000033.1.log

log: logs/freebayes/chromosome_1.log

chromosome_1.log

thank you again
osp

@lczech
Copy link
Member

lczech commented Apr 11, 2024

Hi @ospfsg,

interesting. What exactly do you mean by

It crashed and the terminal closed by himself.

Without any error message, the terminal just closes? That does not seem like anything that would happen from running snakemake or grenepipe, so there must be something more weird going on. Could you somehow maybe give me some steps for how I could reproduce this behavior, or describe what is happening in more detail? Or even better, use a screen recording tool to get a video clip of what is happening?

From your log files, there is nothing that explains the behavior that you describe. I am seeing

Invalid tag name: "technology.-"

in there. Is that name something that shows up anywhere in your data, as a sample name or chromosome name, or somewhere else? Apart from that, there are no clues that would help right now.

I am traveling and won't be able to investigate this issue for another week. I'll then try to reproduce with the freebayes settings that you posted above. Let's see if I can recreate the issue as well. If not, and in the meantime: How big is your data set? Could you maybe prepare a minimal subset of your data that causes the issue, and that I can use to investigate? It seems like a rather involved and complicated issue, if the terminal just closes. I have never seen that, and tools such as freebayes, snakemake (and grenepipe) should usually not do that - hence, I have not really a good idea what is going on there.

So long, let's see what we can figure out
Lucas

@ospfsg
Copy link
Author

ospfsg commented Apr 12, 2024

Hi

I just run 1/4 of the poolseq dataset and everything went smoothly in less than 24 hours, generating 65Million SNPs!

So maybe is my workstation (128 threads and 512GB RAM) that doesn't cope with the full dataset. I am going to try with increase amount to check how far I can go.

osp

@lczech
Copy link
Member

lczech commented May 23, 2024

Hi @ospfsg,

how are things going? Did the pipeline work now with a reduced data set? From what you described, I can only guess what has happened - but if it works with smaller data, and fails with larger, then it is rather likely that this is some issue with memory or your workstation, instead of something related to grenepipe itself.

Let me know what you think, and if we can close this issue then.

Cheers and so long
Lucas

@ospfsg
Copy link
Author

ospfsg commented May 28, 2024

Hi Lucas

The problem seems not to be the amount of data but the ploidy that is used in freebayes. I am currently running with ploidy of 4 in spite that in the pool I have 40 samples. So the problem is not in your pipeline, the problem is the capacity of my workstation dealing with a freebayes -p 40.. It copes until ploidy of 30 but apparently not more than that!!!
When I used the bam files that come out of the pipeline (dedup) and try to run them in freebayes I got the same problem
So the problem is not in your pipeline, the problem is the capacity of my workstation dealing with a freebayes -p 40.. It copes until ploidy of 30 but apparently not more than that!!!
Your pipeline implement natively the parallelization in freebayes ?

Thank you
osp

@lczech
Copy link
Member

lczech commented May 28, 2024

Hi @ospfsg,

ah okay. So, it seems you want to use freebayes for variant calling on pooled data? In that case, I'd recommend to have a look at our preprint where we show that this can lead to issues and biases down the line. That's because the typical variant callers (all three that grenepipe offers) are not meant for pooled data - and hence their statistical assumptions for detecting what is a variant (in individuals) not applicable for pools. That's also why these tools are not optimized for the artificially high ploidy that you are using there.

Nonetheless, this type of variant calling on pooled data seems to be common practice. I'd however advise against it. Instead, what we found has been working well, and is also way faster, is to just map the reads to the reference genome you are using. This can for example be done by only running the mapping part of the pipeline, see here.
Then, you can work on the raw counts of how many nucleotides were mapped to each position (often called allele counts), which acts as a proxy for allele frequencies, and can be used for downstream analyses instead of called variants. This also offers a more sophisticated view on the data, where instead of a hard yes/no variant call, you can operate on frequencies, and hence capture more detail.

We are also currently in the very last steps of publishing our tool for working with this kind of data, called grenedalf. If you want to run some typical downstream analysis, this might help. If grenedalf is lacking some functionality that you need, let me know. However, wait a few more days before using it - by the end of the week (hopefully) I'll have a new version out, in which many things are improved and corrected compared to the current v0.3.0.

As for your question:

Your pipeline implement natively the parallelization in freebayes ?

Variant calling is parallelized over chromosomes. It however by its very nature has to be run on all samples at the same time, so there is unfortunately not much more than can be parallelized easily.

Hope that helps, and so long
Lucas

@ospfsg
Copy link
Author

ospfsg commented May 29, 2024

Hi Lucas

Thank very much for your advice. I will be happy to test grenedalf.

In your prepprint you have an interesting paragraph :

These analyses resulted in prohibitively long runtimes even in cluster environments (GATK HaplotypeCaller) and large memory usage (freebayes), demonstrating these tools’ limited capabilities for analyzing large datasets and large pool sizes.
We hence ran the three callers with default ploidy of 2 to study their artifacts in Pool-Seq applications, assuming that other researchers may be required to resort to these default settings.

I assume that the second sentence is about your experiment 2, but did you test the impact on the results of a "fake" low ploidy on the results ?

I used the .bam files after picard the ones in the dedup folder, that seems to me be the input for the snpcallers. Can I used then for the input grenedalf?

Thank you
osp

@lczech
Copy link
Member

lczech commented May 29, 2024

Hi Octavio,

I assume that the second sentence is about your experiment 2, but did you test the impact on the results of a "fake" low ploidy on the results ?

Yes, if I recall correctly, that's what we did. See supplementary figures S6 onward in the manuscript :-)

I used the .bam files after picard the ones in the dedup folder, that seems to me be the input for the snpcallers. Can I used then for the input grenedalf?

Yes you can! It seemed rather wasteful and annoying to me to have to convert everything to pileup files first, so in grenedalf, I implemented to read directly from sam/bam files ;-)

Cheers
Lucas

@ospfsg
Copy link
Author

ospfsg commented Jun 6, 2024

Hi Lucas

is it possible to run grenepipe with the starting point of bam files (from dedup) and run the different snpcallers instead of starting the all process from the beginning each time we want to use a different caller ?

best
osp

@lczech
Copy link
Member

lczech commented Jun 8, 2024

Hi @ospfsg,

good question, and you are not the only one with a request like that. Unfortunately, Snakemake itself is not meant for these things where changes from the outside are needed in the middle of a workflow run. And within grenepipe, I've had users who wanted similar things of "start from there", "re-use these files", etc, but always in a slightly different way, which means, I cannot easily integrate that as a feature, because it would need to be so flexible to accommodate for all these use cases that it would be both hard to implement and hard to use...

So, long story short, the current solution for use cases like that is to trick Snakemake itself into doing what you want. Depending on how you wanna do this, this can be a bit tricky, but possible.

The straight forward approach is as follows: Run the pipeline with a variant caller. Then, move the directories that contain all variant-caller-related files to somewhere else. These directories are: called, filtered, and genotyped. Then, change the config.yaml to the next variant caller you want to try, and run the pipeline again. Snakemake will recognize that all the mapping files are there, but that the calling files are not, and hence only run those parts. This approach is the easiest one, and I'd recommend to do that.

Couple of tips for this approach: Before running Snakemake again for the other callers, I'd check first with the extra flags -nq when running that Snakemake actually does not want to re-run the trimming and mapping. Also, I'd copy the config.yaml file of each run to the same directory where you move the called, filtered, and genotyped directory to, so that the config file that created those variant calls is preserved as well.

The only downside is that you can only run the pipeline with one variant caller (or one config setting in general) at a time. For the three callers implemented, I think this is good enough. If you'd need to scale this up, to test many settings of each caller, etc, a more scalable approach would be needed, where you can run everything at once, and don't need to manually copy directories each time. If you need that instead, let me know.

Hope that helps, and so long
Lucas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants