Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about memory requirements and determining appropriate kmer size #13

Open
eocampbe opened this issue Mar 25, 2020 · 1 comment

Comments

@eocampbe
Copy link

eocampbe commented Mar 25, 2020

Hi there,

I have completed an initial analysis using DiscoverY in the female+male mode, and I am wondering how I might determine whether the kmer size I used is optimal for the data I have. For the analysis I've done so far, I used the default size of 25, but I understand that this may need to be adjusted based on the specific characteristics of the genome I'm working with. I have plotted the results of my analysis (attached), and there seem to be a large number of kmers with very low similarity to the female genome, which is of quite good quality, but high depth. The organism we're working on has a neo sex chromosome system, so I suspect the Y regions are clustering in with the X regions on the bottom right of the graph (confirming this was actually my reason for using DiscoverY), however I'm less sure about why there might be so many male contigs that have very low similarity to the female, but rather high coverage. I don't know if this is a result of my kmer parameter or something else, but I'm hoping you might be able to offer some advice.

In addition, this analysis required about 720 Gb of RAM, which is about double what is estimated in the paper and is almost the maximum amount of RAM I'm allowed to ask for per node of the cluster I'm using. Can DiscoverY run in parallel so that I can spread this memory out across multiple nodes? I don't see anything in the documentation or the paper that mentions this, but it would be very helpful for subsequent analyses.

Thanks,
Erin
graph.pdf

@sheinasim
Copy link

Hello!

I am not a part of the group who wrote this software, but I was able to calculate the appropriate kmer size by applying the random boundary formula, (Eq. 2 in Fofanov et al 2004).

kmer size = log[ genome_size (1 - error_rate_of_reads) / error_rate_of_reads ] / log(4)

Hope this helps!

@RAWWiberg RAWWiberg mentioned this issue Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants