Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary eats all my RAM even on moderate data (400 MB) #1119

Closed
janxkoci opened this issue Nov 1, 2022 · 9 comments
Closed

Summary eats all my RAM even on moderate data (400 MB) #1119

janxkoci opened this issue Nov 1, 2022 · 9 comments

Comments

@janxkoci
Copy link

janxkoci commented Nov 1, 2022

I wanted to play with the summary verb, but I keep running out of RAM on my laptop with 16 GB of RAM. This is surprising, since most of the metrics should need only a few oosvars, with the possible exceptions of median and maybe distinct_count. But even removing those accumulators still leads to all RAM being consumed and me killing the process.

Moreover, doing the equivalent operation with stats1 has mostly negligible RAM consumption, although median indeed consumes a lot of RAM.

But let's see some data.

Data

For start, my data looks as follows:

$ ls -lh archaics_african_sardinian_papuan.tsv
-rw-rw-r-- 1 jena jena 390M kvě 19 17:38 archaics_african_sardinian_papuan.tsv

$ mlr --t2m --from archaics_african_sardinian_papuan.tsv count
count
9854587
mlr --t2m --from archaics_african_sardinian_papuan.tsv head -n 3
chr pos African Sardinian Papuan Altai Vindija Chagyrskaya Denisova UstIshim
1 752721 0.19767442 0.94444444 0.1875 1 1 1 1 0.5
1 754163 0.97727273 0.94642857 0.94117647 1 1 1 1 1
1 773764 0.011627907 0 0 0 0 0 0 0

In short, the first two columns represent integer positions of genetic variants (chromosome ID and nucleotide position within the chromosome). The rest of the columns provide per-population allele frequency at each position - these are floats between 0 and 1.

Now, running this will eat all my RAM:

mlr --t2x --from archaics_african_sardinian_papuan.tsv summary

Same for this:

mlr --t2x --from archaics_african_sardinian_papuan.tsv summary -a mean,min,max

While this finishes fine with negligible RAM consumption:

mlr --t2x --from archaics_african_sardinian_papuan.tsv stats1 -a mean,min,max -f chr,pos,African,Sardinian,Papuan,Altai,Vindija,Chagyrskaya,Denisova,UstIshim

Adding median to the above again leads to my RAM disappearing.

Other tools

The most surprising part for me is that I can easily load the full data into R and perform the same operations, with very reasonable RAM consumption:

$ radian
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Platform: x86_64-conda-linux-gnu (64-bit)
> daf = read.table("archaics_african_sardinian_papuan.tsv", header = T, sep = "\t")
> summary(daf)
      chr              pos               African          Sardinian          Papuan           Altai           Vindija        Chagyrskaya    
 Min.   : 1.000   Min.   :    12311   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: 4.000   1st Qu.: 34043942   1st Qu.:0.01136   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median : 7.000   Median : 72563811   Median :0.02273   Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   : 8.483   Mean   : 80570004   Mean   :0.13844   Mean   :0.1388   Mean   :0.1386   Mean   :0.1105   Mean   :0.1135   Mean   :0.1125  
 3rd Qu.:12.000   3rd Qu.:116845932   3rd Qu.:0.13636   3rd Qu.:0.1071   3rd Qu.:0.0625   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :22.000   Max.   :249219576   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    Denisova         UstIshim     
 Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000  
 Mean   :0.1114   Mean   :0.1357  
 3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.0000
> object.size(daf)
709532208 bytes
> print(object.size(daf), units = "Mb")
676.7 Mb

Clearly, R has no problem handling the size of the data - my sysmonitor tells me the R session takes some 1.59 GB of memory.

So why does Miller need so much more?

@johnkerl
Copy link
Owner

johnkerl commented Nov 26, 2022

@janxkoci the Mlrval type in Miller 6 is a large polymorphic type -- there aren't unions as in C so this ends up being 88 bytes total. While the min, mean, and max aggregators only store O(1) data, median stores all n data points -- as an array of Mlrval -- and as you've seen this adds up. Some redesign work is needed here.

@johnkerl johnkerl changed the title summary eats all my RAM even on moderate data (400 MB) Summary eats all my RAM even on moderate data (400 MB) Nov 26, 2022
@johnkerl
Copy link
Owner

Also for the moment median should be omitted from the default-summarizers list for the summary verb -- still available as opt-in, just not in place by default.

@janxkoci
Copy link
Author

Thanks, this should help for the basic use. I think it also makes sense to disable distinct_count for float-only fields - that's a bad idea in 100.000000000001% of situations.

But it's still curious that summary -a mean,min,max runs out of memory, while stats1 -a mean,min,max does not. I think somewhere there may be another problem with summary... 🤔

@johnkerl
Copy link
Owner

But it's still curious that summary -a mean,min,max runs out of memory, while stats1 -a mean,min,max does not. I think somewhere there may be another problem with summary... 🤔

@janxkoci Yes, addressed on #1131 -- due to a typo, the percentile-keeper data structure was being populated even for min and max -- but not being used for any outputs. As of latest head, that's not the case anymore.

@johnkerl
Copy link
Owner

johnkerl commented Nov 26, 2022

@janxkoci also: #1132 gets about a factor of two size reduction for when we are computing percentiles/median (or any other retain-entire-column/retain-entire-record cases)

@johnkerl
Copy link
Owner

@janxkoci #1133 gets perhaps another 20% memory savings

I don't have low-hanging fruit left ... more things that can be handled with much deeper rework.

I have some remorse about last year's port from C to Go ... there are some zero-sum tradeoffs between memory consumption and working around Go-runtime overhead ... also some things perhaps just need deeper thinking, regardless of language ...

@janxkoci
Copy link
Author

janxkoci commented Nov 30, 2022

I think this is great, thanks. And don't worry, I think the port was a good idea 😉 Miller works really well for most things I tried, the summary verb just had a few bugs, that's all ☺️

PS: I can test the fixes whenever you make a new release on conda...

@johnkerl
Copy link
Owner

@janxkoci thanks! And, Miller 6.5.0 is now available on conda-forge: https://anaconda.org/conda-forge/miller/

@janxkoci
Copy link
Author

janxkoci commented Nov 30, 2022

Just updated and tested and the summary now finishes fine, my peak RAM usage barely reached 50% (starting from ~30% used by my other stuff).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants