New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary eats all my RAM even on moderate data (400 MB) #1119
Comments
@janxkoci the |
Also for the moment |
Thanks, this should help for the basic use. I think it also makes sense to disable But it's still curious that |
@janxkoci Yes, addressed on #1131 -- due to a typo, the percentile-keeper data structure was being populated even for |
@janxkoci #1133 gets perhaps another 20% memory savings I don't have low-hanging fruit left ... more things that can be handled with much deeper rework. I have some remorse about last year's port from C to Go ... there are some zero-sum tradeoffs between memory consumption and working around Go-runtime overhead ... also some things perhaps just need deeper thinking, regardless of language ... |
I think this is great, thanks. And don't worry, I think the port was a good idea 😉 Miller works really well for most things I tried, the PS: I can test the fixes whenever you make a new release on conda... |
@janxkoci thanks! And, Miller 6.5.0 is now available on conda-forge: https://anaconda.org/conda-forge/miller/ |
Just updated and tested and the |
I wanted to play with the
summary
verb, but I keep running out of RAM on my laptop with 16 GB of RAM. This is surprising, since most of the metrics should need only a few oosvars, with the possible exceptions ofmedian
and maybedistinct_count
. But even removing those accumulators still leads to all RAM being consumed and me killing the process.Moreover, doing the equivalent operation with
stats1
has mostly negligible RAM consumption, althoughmedian
indeed consumes a lot of RAM.But let's see some data.
Data
For start, my data looks as follows:
In short, the first two columns represent integer positions of genetic variants (chromosome ID and nucleotide position within the chromosome). The rest of the columns provide per-population allele frequency at each position - these are floats between 0 and 1.
Now, running this will eat all my RAM:
Same for this:
While this finishes fine with negligible RAM consumption:
Adding
median
to the above again leads to my RAM disappearing.Other tools
The most surprising part for me is that I can easily load the full data into R and perform the same operations, with very reasonable RAM consumption:
Clearly, R has no problem handling the size of the data - my sysmonitor tells me the R session takes some 1.59 GB of memory.
So why does Miller need so much more?
The text was updated successfully, but these errors were encountered: