Summary eats all my RAM even on moderate data (400 MB) #1119

janxkoci · 2022-11-01T13:34:03Z

I wanted to play with the summary verb, but I keep running out of RAM on my laptop with 16 GB of RAM. This is surprising, since most of the metrics should need only a few oosvars, with the possible exceptions of median and maybe distinct_count. But even removing those accumulators still leads to all RAM being consumed and me killing the process.

Moreover, doing the equivalent operation with stats1 has mostly negligible RAM consumption, although median indeed consumes a lot of RAM.

But let's see some data.

Data

For start, my data looks as follows:

$ ls -lh archaics_african_sardinian_papuan.tsv
-rw-rw-r-- 1 jena jena 390M kvě 19 17:38 archaics_african_sardinian_papuan.tsv

$ mlr --t2m --from archaics_african_sardinian_papuan.tsv count

count
9854587

mlr --t2m --from archaics_african_sardinian_papuan.tsv head -n 3

chr	pos	African	Sardinian	Papuan	Altai	Vindija	Chagyrskaya	Denisova	UstIshim
1	752721	0.19767442	0.94444444	0.1875	1	1	1	1	0.5
1	754163	0.97727273	0.94642857	0.94117647	1	1	1	1	1
1	773764	0.011627907	0	0	0	0	0	0	0

In short, the first two columns represent integer positions of genetic variants (chromosome ID and nucleotide position within the chromosome). The rest of the columns provide per-population allele frequency at each position - these are floats between 0 and 1.

Now, running this will eat all my RAM:

mlr --t2x --from archaics_african_sardinian_papuan.tsv summary

Same for this:

mlr --t2x --from archaics_african_sardinian_papuan.tsv summary -a mean,min,max

While this finishes fine with negligible RAM consumption:

mlr --t2x --from archaics_african_sardinian_papuan.tsv stats1 -a mean,min,max -f chr,pos,African,Sardinian,Papuan,Altai,Vindija,Chagyrskaya,Denisova,UstIshim

Adding median to the above again leads to my RAM disappearing.

Other tools

The most surprising part for me is that I can easily load the full data into R and perform the same operations, with very reasonable RAM consumption:

$ radian
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Platform: x86_64-conda-linux-gnu (64-bit)

> daf = read.table("archaics_african_sardinian_papuan.tsv", header = T, sep = "\t")
> summary(daf)
      chr              pos               African          Sardinian          Papuan           Altai           Vindija        Chagyrskaya    
 Min.   : 1.000   Min.   :    12311   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: 4.000   1st Qu.: 34043942   1st Qu.:0.01136   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median : 7.000   Median : 72563811   Median :0.02273   Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   : 8.483   Mean   : 80570004   Mean   :0.13844   Mean   :0.1388   Mean   :0.1386   Mean   :0.1105   Mean   :0.1135   Mean   :0.1125  
 3rd Qu.:12.000   3rd Qu.:116845932   3rd Qu.:0.13636   3rd Qu.:0.1071   3rd Qu.:0.0625   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :22.000   Max.   :249219576   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    Denisova         UstIshim     
 Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000  
 Mean   :0.1114   Mean   :0.1357  
 3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.0000
> object.size(daf)
709532208 bytes
> print(object.size(daf), units = "Mb")
676.7 Mb

Clearly, R has no problem handling the size of the data - my sysmonitor tells me the R session takes some 1.59 GB of memory.

So why does Miller need so much more?

The text was updated successfully, but these errors were encountered:

johnkerl · 2022-11-26T05:18:34Z

@janxkoci the Mlrval type in Miller 6 is a large polymorphic type -- there aren't unions as in C so this ends up being 88 bytes total. While the min, mean, and max aggregators only store O(1) data, median stores all n data points -- as an array of Mlrval -- and as you've seen this adds up. Some redesign work is needed here.

johnkerl · 2022-11-26T05:35:19Z

Also for the moment median should be omitted from the default-summarizers list for the summary verb -- still available as opt-in, just not in place by default.

janxkoci · 2022-11-26T09:14:32Z

Thanks, this should help for the basic use. I think it also makes sense to disable distinct_count for float-only fields - that's a bad idea in 100.000000000001% of situations.

But it's still curious that summary -a mean,min,max runs out of memory, while stats1 -a mean,min,max does not. I think somewhere there may be another problem with summary... 🤔

johnkerl · 2022-11-26T16:09:37Z

But it's still curious that summary -a mean,min,max runs out of memory, while stats1 -a mean,min,max does not. I think somewhere there may be another problem with summary... 🤔

@janxkoci Yes, addressed on #1131 -- due to a typo, the percentile-keeper data structure was being populated even for min and max -- but not being used for any outputs. As of latest head, that's not the case anymore.

johnkerl · 2022-11-26T16:14:41Z

@janxkoci also: #1132 gets about a factor of two size reduction for when we are computing percentiles/median (or any other retain-entire-column/retain-entire-record cases)

johnkerl · 2022-11-27T04:13:47Z

@janxkoci #1133 gets perhaps another 20% memory savings

I don't have low-hanging fruit left ... more things that can be handled with much deeper rework.

I have some remorse about last year's port from C to Go ... there are some zero-sum tradeoffs between memory consumption and working around Go-runtime overhead ... also some things perhaps just need deeper thinking, regardless of language ...

janxkoci · 2022-11-30T10:41:34Z

I think this is great, thanks. And don't worry, I think the port was a good idea 😉 Miller works really well for most things I tried, the summary verb just had a few bugs, that's all ☺️

PS: I can test the fixes whenever you make a new release on conda...

johnkerl · 2022-11-30T14:36:16Z

@janxkoci thanks! And, Miller 6.5.0 is now available on conda-forge: https://anaconda.org/conda-forge/miller/

janxkoci · 2022-11-30T14:47:12Z

Just updated and tested and the summary now finishes fine, my peak RAM usage barely reached 50% (starting from ~30% used by my other stuff).

johnkerl self-assigned this Nov 26, 2022

johnkerl added the active label Nov 26, 2022

johnkerl mentioned this issue Nov 26, 2022

Use int8 for mvtype (memory reduction) #1130

Merged

johnkerl changed the title ~~summary eats all my RAM even on moderate data (400 MB)~~ Summary eats all my RAM even on moderate data (400 MB) Nov 26, 2022

johnkerl mentioned this issue Nov 26, 2022

Out-of-memory error: if I run remove-empty-columns and clean-whitespace before summary verb #1090

Closed

This was referenced Nov 26, 2022

Exclude median from summary default #1131

Merged

More mlrval size-reduction #1132

Merged

This was referenced Nov 26, 2022

mlr count using much more memory than expected #1028

Closed

Convert mlrval polymorphism from struct to unionish interface #1133

Merged

johnkerl added pending feedback to close and removed active labels Nov 27, 2022

janxkoci closed this as completed Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary eats all my RAM even on moderate data (400 MB) #1119

Summary eats all my RAM even on moderate data (400 MB) #1119

janxkoci commented Nov 1, 2022

johnkerl commented Nov 26, 2022 •

edited

johnkerl commented Nov 26, 2022

janxkoci commented Nov 26, 2022

johnkerl commented Nov 26, 2022

johnkerl commented Nov 26, 2022 •

edited

johnkerl commented Nov 27, 2022

janxkoci commented Nov 30, 2022 •

edited

johnkerl commented Nov 30, 2022

janxkoci commented Nov 30, 2022 •

edited

Summary eats all my RAM even on moderate data (400 MB) #1119

Summary eats all my RAM even on moderate data (400 MB) #1119

Comments

janxkoci commented Nov 1, 2022

Data

Other tools

johnkerl commented Nov 26, 2022 • edited

johnkerl commented Nov 26, 2022

janxkoci commented Nov 26, 2022

johnkerl commented Nov 26, 2022

johnkerl commented Nov 26, 2022 • edited

johnkerl commented Nov 27, 2022

janxkoci commented Nov 30, 2022 • edited

johnkerl commented Nov 30, 2022

janxkoci commented Nov 30, 2022 • edited

johnkerl commented Nov 26, 2022 •

edited

johnkerl commented Nov 26, 2022 •

edited

janxkoci commented Nov 30, 2022 •

edited

janxkoci commented Nov 30, 2022 •

edited