`mlr count` using much more memory than expected #1028

jgarthur · 2022-05-27T20:02:25Z

Hi, thanks for developing this great tool!

I was working on an application of mlr count -g var1,var2, where var1 and var2 are both strings, and the input file is quite large (51GB uncompressed). I noticed the memory usage in htop growing until exceeding the uncompressed input file size. Could there be a memory leak here?

I've reproduced the issue with a minimal example containing only 1 column:

# 76 MB input file with one column, 1M unique string values with a count of 10 each
$ wc -l test.csv
10000001 test.csv
$ head -n 5 test.csv
a
A0
A1
A2
A3
$ tail -n 5 test.csv
A999995
A999996
A999997
A999998
A999999

# 2.5 GB max RSS
$ /usr/bin/time mlr --csv count -g a -o count test.csv > test_out_mlr
31.31user 9.92system 0:12.40elapsed 332%CPU (0avgtext+0avgdata 2548744maxresident)k
0inputs+21272outputs (0major+618198minor)pagefaults 0swaps

# 1.9 MB max RSS
$ /usr/bin/time cat test.csv | tail -n +2 | gawk '{c[$1] += 1} END {for (x in c) {print x "," c[x]}}' > test_out_awk
0.00user 0.25system 0:03.38elapsed 7%CPU (0avgtext+0avgdata 1896maxresident)k
0inputs+0outputs (0major+93minor)pagefaults 0swaps

# same results modulo header and sort order
$ diff <(sort test_out_awk) <(sort test_out_mlr)
1000000a1000001
> a,count

The text was updated successfully, but these errors were encountered:

johnkerl · 2022-05-28T01:23:18Z

Hi @jgarthur !!

Thanks for submitting this! :)

I think the test.csv example may be due in part to a "baseline RSS" rather than a leak issue ... there are three things I'm aware of: one is that Go executables are statically linked; another is that the entire Go runtime is present in that linkage; the third is that the (dense, not sparse) LR1-parser matrices take up quite a bit of memory. The first two issues are intrinsic to Go; the third, due to my use of GOCC -- a "someday" project would be to try out GOGGL and see if that helps.

I think the question of leak-or-no-leak depends on the number of unique var1,var2 pairs -- if there are a few this sounds very leaky; if there are many, this sounds like it might be associated with hash-map overhead associated with tracking counts.

I will try

mlr --csv head -n 100 then count -g  a -o count test.csv
mlr --csv head -n 1000 then count -g  a -o count test.csv
mlr --csv head -n 10000 then count -g  a -o count test.csv
mlr --csv head -n 100000 then count -g a -o count test.csv
mlr --csv head -n 1000000 then count -g a -o count test.csv
...

etc to get a sense of what's baseline RSS and what's data-dependent.

johnkerl · 2022-11-26T16:46:13Z

Related to #1119

johnkerl · 2023-03-06T05:26:26Z

I've done as much as I can on #1119; please re-open if this is still a blocking issue.

johnkerl added the active label May 30, 2022

johnkerl removed the active label Sep 6, 2022

johnkerl self-assigned this Nov 27, 2022

johnkerl added the active label Nov 27, 2022

johnkerl removed the active label Jan 1, 2023

johnkerl changed the title ~~count using much more memory than expected~~ mlr count using much more memory than expected Feb 28, 2023

johnkerl closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`mlr count` using much more memory than expected #1028

`mlr count` using much more memory than expected #1028

jgarthur commented May 27, 2022 •

edited

johnkerl commented May 28, 2022 •

edited

johnkerl commented Nov 26, 2022

johnkerl commented Mar 6, 2023

mlr count using much more memory than expected #1028

mlr count using much more memory than expected #1028

Comments

jgarthur commented May 27, 2022 • edited

johnkerl commented May 28, 2022 • edited

johnkerl commented Nov 26, 2022

johnkerl commented Mar 6, 2023

`mlr count` using much more memory than expected #1028

`mlr count` using much more memory than expected #1028

jgarthur commented May 27, 2022 •

edited

johnkerl commented May 28, 2022 •

edited