Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlr count using much more memory than expected #1028

Closed
jgarthur opened this issue May 27, 2022 · 3 comments
Closed

mlr count using much more memory than expected #1028

jgarthur opened this issue May 27, 2022 · 3 comments
Assignees

Comments

@jgarthur
Copy link

jgarthur commented May 27, 2022

Hi, thanks for developing this great tool!

I was working on an application of mlr count -g var1,var2, where var1 and var2 are both strings, and the input file is quite large (51GB uncompressed). I noticed the memory usage in htop growing until exceeding the uncompressed input file size. Could there be a memory leak here?

I've reproduced the issue with a minimal example containing only 1 column:

# 76 MB input file with one column, 1M unique string values with a count of 10 each
$ wc -l test.csv
10000001 test.csv
$ head -n 5 test.csv
a
A0
A1
A2
A3
$ tail -n 5 test.csv
A999995
A999996
A999997
A999998
A999999

# 2.5 GB max RSS
$ /usr/bin/time mlr --csv count -g a -o count test.csv > test_out_mlr
31.31user 9.92system 0:12.40elapsed 332%CPU (0avgtext+0avgdata 2548744maxresident)k
0inputs+21272outputs (0major+618198minor)pagefaults 0swaps

# 1.9 MB max RSS
$ /usr/bin/time cat test.csv | tail -n +2 | gawk '{c[$1] += 1} END {for (x in c) {print x "," c[x]}}' > test_out_awk
0.00user 0.25system 0:03.38elapsed 7%CPU (0avgtext+0avgdata 1896maxresident)k
0inputs+0outputs (0major+93minor)pagefaults 0swaps

# same results modulo header and sort order
$ diff <(sort test_out_awk) <(sort test_out_mlr)
1000000a1000001
> a,count
@johnkerl
Copy link
Owner

johnkerl commented May 28, 2022

Hi @jgarthur !!

Thanks for submitting this! :)

I think the test.csv example may be due in part to a "baseline RSS" rather than a leak issue ... there are three things I'm aware of: one is that Go executables are statically linked; another is that the entire Go runtime is present in that linkage; the third is that the (dense, not sparse) LR1-parser matrices take up quite a bit of memory. The first two issues are intrinsic to Go; the third, due to my use of GOCC -- a "someday" project would be to try out GOGGL and see if that helps.

I think the question of leak-or-no-leak depends on the number of unique var1,var2 pairs -- if there are a few this sounds very leaky; if there are many, this sounds like it might be associated with hash-map overhead associated with tracking counts.

I will try

mlr --csv head -n 100 then count -g  a -o count test.csv
mlr --csv head -n 1000 then count -g  a -o count test.csv
mlr --csv head -n 10000 then count -g  a -o count test.csv
mlr --csv head -n 100000 then count -g a -o count test.csv
mlr --csv head -n 1000000 then count -g a -o count test.csv
...

etc to get a sense of what's baseline RSS and what's data-dependent.

@johnkerl johnkerl removed the active label Sep 6, 2022
@johnkerl
Copy link
Owner

Related to #1119

@johnkerl johnkerl self-assigned this Nov 27, 2022
@johnkerl johnkerl removed the active label Jan 1, 2023
@johnkerl johnkerl changed the title count using much more memory than expected mlr count using much more memory than expected Feb 28, 2023
@johnkerl
Copy link
Owner

johnkerl commented Mar 6, 2023

I've done as much as I can on #1119; please re-open if this is still a blocking issue.

@johnkerl johnkerl closed this as completed Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants